Reputation: 1256
I have a DataFrame where I have 1000s of rows and 100s of column where I want to forwardfill the data but grouped by id and original data ( date range). What I mean by original data is if we have a data for id 1 for date 01/01/2020 but null value for date 01/05/2020, 02/02/2020, I would like to fill the data on 01/05/2020 but not 02/02/2020 since 02/02/2020 is not within 30 days period. When we ffill, it fills all data based on last result.
import pandas as pd
import numpy as np
res= pd.DataFrame({'id':[1,1,1,1,1,2,2],
'date':['01/01/2020','01/05/2020','02/03/2020','02/05/2020','04/01/2020','01/01/2020','01/02/2020'],
'result':[1.5,np.nan,np.nan,2.6,np.nan,np.nan,6.0]})
res['result1']= res.groupby(['id']).apply(lambda x: x.result.ffill()).reset_index(drop=True)
result I get is:
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN 1.5
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN 2.6
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
What I want is :
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN NaN
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN NaN
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
Upvotes: 4
Views: 754
Reputation: 150785
Not so elegant as Ben's merge_asof
, but you can do something like this:
res['date'] = pd.to_datetime(res['date'])
# valid blocks
valids = res['result'].notna().cumsum()
# first dates in each block
first_dates = res.groupby(['id',valids])['date'].transform('min')
# How far we ffill
mask = (res['date']-first_dates)<pd.Timedelta('30D')
# ffill and then mask
res['result1'] = res['result'].groupby(res['id']).ffill().where(mask)
Output:
id date result result1
0 1 2020-01-01 1.5 1.5
1 1 2020-01-05 NaN 1.5
2 1 2020-02-03 NaN NaN
3 1 2020-02-05 2.6 2.6
4 1 2020-04-01 NaN NaN
5 2 2020-01-01 NaN NaN
6 2 2020-01-02 6.0 6.0
Upvotes: 1
Reputation: 323316
You can try with merge_asof
res['date']=pd.to_datetime(res['date'])
res = res.sort_values('date')
res1 = res.dropna(subset=['result']).rename(columns={'result':'result1'})
out = pd.merge_asof(res.reset_index(),res1 , by ='id', on ='date',tolerance = pd.Timedelta(30, unit='d'),direction = 'backward').sort_values('index')
Out[72]:
index id date result result1
0 0 1 2020-01-01 1.5 1.5
3 1 1 2020-01-05 NaN 1.5
4 2 1 2020-02-03 NaN NaN
5 3 1 2020-02-05 2.6 2.6
6 4 1 2020-04-01 NaN NaN
1 5 2 2020-01-01 NaN NaN
2 6 2 2020-01-02 6.0 6.0
Upvotes: 3