sandertjuh
sandertjuh

Reputation: 580

pd.DataFrame: get average value per id between two datetimes; if NaN, get last non-NaN value

Say I have the following pd.DataFrame (df.to_dict()):

    eff_timestamp       val         id  begin_timestamp     end_timestamp
0   2021-01-01 00:00:00 0.677085    1   2021-01-01 02:00:00 2021-01-01 05:30:00
1   2021-01-01 01:00:00 -0.356381   1   2021-01-01 02:00:00 2021-01-01 05:30:00
2   2021-01-01 02:00:00 1.697311    1   2021-01-01 02:00:00 2021-01-01 05:30:00
3   2021-01-01 03:00:00 0.910820    1   2021-01-01 02:00:00 2021-01-01 05:30:00
4   2021-01-01 04:00:00 -1.024458   1   2021-01-01 02:00:00 2021-01-01 05:30:00
5   2021-01-01 05:00:00 -0.430950   1   2021-01-01 02:00:00 2021-01-01 05:30:00
6   2021-01-01 06:00:00 -1.124934   1   2021-01-01 02:00:00 2021-01-01 05:30:00
7   2021-01-01 07:00:00 0.791751    1   2021-01-01 02:00:00 2021-01-01 05:30:00
8   2021-01-02 00:00:00 0.629035    2   2021-01-02 02:00:00 2021-01-02 05:30:00
9   2021-01-02 01:00:00 0.445033    2   2021-01-02 02:00:00 2021-01-02 05:30:00
10  2021-01-02 02:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
11  2021-01-02 03:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
12  2021-01-02 04:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
13  2021-01-02 05:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
14  2021-01-02 06:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
15  2021-01-02 07:00:00 -0.637133   2   2021-01-02 02:00:00 2021-01-02 05:30:00

I would like to get one value per unique id, that holds the mean of val for the period between begin_timestamp and end_timestamp (based on eff_timestamp). If that value returns np.nan, I would like to get the last available value that is not np.nan. I know how to get the mean of 'val' between the begin and end timestamp:

sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
    eff_timestamp       val         id  begin_timestamp     end_timestamp
3   2021-01-01 03:00:00 0.910820    1   2021-01-01 02:00:00 2021-01-01 05:30:00
4   2021-01-01 04:00:00 -1.024458   1   2021-01-01 02:00:00 2021-01-01 05:30:00
5   2021-01-01 05:00:00 -0.430950   1   2021-01-01 02:00:00 2021-01-01 05:30:00
11  2021-01-02 03:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
12  2021-01-02 04:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00
13  2021-01-02 05:00:00 NaN         2   2021-01-02 02:00:00 2021-01-02 05:30:00

sliced.groupby('id').val.mean()
>>>
id
1   -0.181529
2         NaN
Name: val, dtype: float64

Because id=2 only has NaN values between 2021-01-02 02:00:00 and 2021-01-02 05:30:00, this returns NaN. However, in that case, I'd like to get the value 0.445033 because that's the last non-NaN value for that patient. How would I do this? The output should thus be:

id
1   -0.181529
2   0.445033
Name: val, dtype: float64

Upvotes: 1

Views: 92

Answers (3)

akilat90
akilat90

Reputation: 5696

Use the dataframe that you're leaving out when you've obtained the sliced above. Call it left_out:

# just the negation of what you've used
left_out = df[~((df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))]  

Then, as per the comment above, in order to filter out values before the begin_timestamp:

values_before_the_period = left_out.sort_values(["eff_timestamp", "begin_timestamp"])[left_out['eff_timestamp'] < left_out['begin_timestamp']]

Note that sort_values may not be necessary, if it's already done.

With this, we can obtain a mapping that contains the last non-nan value before the period between begin_timestamp and end_timestamp.

mapping = values_before_the_period.groupby('id').tail(1)[['id', 'val']].set_index('id').to_dict()

# {'val': {1: -0.3563813741494545, 2: 0.445032587866597}}

Make a series out of that:

mapping_s = pd.Series((mapping['val']))
mapping_s

1   -0.356381
2    0.445033
dtype: float64

Now, this can be combined with s which you've obtained, in a way that only the np.nan values will be replaced in s.

s.combine_first(mapping_s)

1   -0.181529
2    0.445033
Name: val, dtype: float64

Upvotes: 1

joao
joao

Reputation: 2293

Continuing from your code, once you've determined that for a given id (let's call it nid) you have a NaN value, we can determine the index where the slice begins:

slice_start = sliced[sliced['id'] == nid].index[0]

Now we can get the portion of the 'val' series that comes before the slice:

portion = df.loc[:slice_start, 'val'][df['id'] == nid]

and use last_valid_index() to get the value you want:

val = df.loc[portion.last_valid_index(), 'val']

Upvotes: 1

Anurag Dabas
Anurag Dabas

Reputation: 24324

You can try:

idx=df[df['val'].isna()].index-1

out=sliced.groupby('id')['val'].mean().fillna(df.loc[idx].groupby('id',sort=False)['val'].first())

Output of out:

id
1   -0.181529
2    0.445033
Name: val, dtype: float64

Upvotes: 2

Related Questions