Reputation: 580
Say I have the following pd.DataFrame
(df.to_dict()):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 0.677085 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 -0.356381 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 1.697311 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.910820 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 -1.024458 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.430950 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 -1.124934 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 0.791751 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 0.629035 2 2021-01-02 02:00:00 2021-01-02 05:30:00
9 2021-01-02 01:00:00 0.445033 2 2021-01-02 02:00:00 2021-01-02 05:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
15 2021-01-02 07:00:00 -0.637133 2 2021-01-02 02:00:00 2021-01-02 05:30:00
I would like to get one value per unique id, that holds the mean of val
for the period between begin_timestamp
and end_timestamp
(based on eff_timestamp
). If that value returns np.nan
, I would like to get the last available value that is not np.nan
. I know how to get the mean of 'val' between the begin and end timestamp:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.910820 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 -1.024458 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.430950 1 2021-01-01 02:00:00 2021-01-01 05:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 02:00:00 2021-01-02 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 -0.181529
2 NaN
Name: val, dtype: float64
Because id=2 only has NaN values between 2021-01-02 02:00:00
and 2021-01-02 05:30:00
, this returns NaN. However, in that case, I'd like to get the value 0.445033 because that's the last non-NaN value for that patient. How would I do this? The output should thus be:
id
1 -0.181529
2 0.445033
Name: val, dtype: float64
Upvotes: 1
Views: 92
Reputation: 5696
Use the dataframe that you're leaving out when you've obtained the sliced
above. Call it left_out
:
# just the negation of what you've used
left_out = df[~((df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))]
Then, as per the comment above, in order to filter out values before the begin_timestamp
:
values_before_the_period = left_out.sort_values(["eff_timestamp", "begin_timestamp"])[left_out['eff_timestamp'] < left_out['begin_timestamp']]
Note that sort_values
may not be necessary, if it's already done.
With this, we can obtain a mapping that contains the last non-nan value before the period between begin_timestamp and end_timestamp.
mapping = values_before_the_period.groupby('id').tail(1)[['id', 'val']].set_index('id').to_dict()
# {'val': {1: -0.3563813741494545, 2: 0.445032587866597}}
Make a series out of that:
mapping_s = pd.Series((mapping['val']))
mapping_s
1 -0.356381
2 0.445033
dtype: float64
Now, this can be combined with s
which you've obtained, in a way that only the np.nan
values will be replaced in s
.
s.combine_first(mapping_s)
1 -0.181529
2 0.445033
Name: val, dtype: float64
Upvotes: 1
Reputation: 2293
Continuing from your code, once you've determined that for a given id (let's
call it nid
) you have a NaN value, we can determine the index where the slice begins:
slice_start = sliced[sliced['id'] == nid].index[0]
Now we can get the portion of the 'val' series that comes before the slice:
portion = df.loc[:slice_start, 'val'][df['id'] == nid]
and use last_valid_index()
to get the value you want:
val = df.loc[portion.last_valid_index(), 'val']
Upvotes: 1
Reputation: 24324
You can try:
idx=df[df['val'].isna()].index-1
out=sliced.groupby('id')['val'].mean().fillna(df.loc[idx].groupby('id',sort=False)['val'].first())
Output of out
:
id
1 -0.181529
2 0.445033
Name: val, dtype: float64
Upvotes: 2