Sam
Sam

Reputation: 1256

How can I ffill a pandas dataframe based on id and "original date" range?

I have a DataFrame where I have 1000s of rows and 100s of column where I want to forwardfill the data but grouped by id and original data ( date range). What I mean by original data is if we have a data for id 1 for date 01/01/2020 but null value for date 01/05/2020, 02/02/2020, I would like to fill the data on 01/05/2020 but not 02/02/2020 since 02/02/2020 is not within 30 days period. When we ffill, it fills all data based on last result.

import pandas as pd
import numpy as np

res= pd.DataFrame({'id':[1,1,1,1,1,2,2],
                   'date':['01/01/2020','01/05/2020','02/03/2020','02/05/2020','04/01/2020','01/01/2020','01/02/2020'],
                   'result':[1.5,np.nan,np.nan,2.6,np.nan,np.nan,6.0]})

res['result1']= res.groupby(['id']).apply(lambda x: x.result.ffill()).reset_index(drop=True)

result I get is:

   id        date  result  result1
0   1  01/01/2020     1.5      1.5
1   1  01/05/2020     NaN      1.5
2   1  02/03/2020     NaN      1.5
3   1  02/05/2020     2.6      2.6
4   1  04/01/2020     NaN      2.6
5   2  01/01/2020     NaN      NaN
6   2  01/02/2020     6.0      6.0

What I want is :

   id        date  result  result1
0   1  01/01/2020     1.5      1.5
1   1  01/05/2020     NaN      1.5
2   1  02/03/2020     NaN      NaN
3   1  02/05/2020     2.6      2.6
4   1  04/01/2020     NaN      NaN
5   2  01/01/2020     NaN      NaN
6   2  01/02/2020     6.0      6.0

Upvotes: 4

Views: 754

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150785

Not so elegant as Ben's merge_asof, but you can do something like this:

res['date'] = pd.to_datetime(res['date'])

# valid blocks
valids = res['result'].notna().cumsum()

# first dates in each block
first_dates = res.groupby(['id',valids])['date'].transform('min')

# How far we ffill
mask = (res['date']-first_dates)<pd.Timedelta('30D')

# ffill and then mask
res['result1'] = res['result'].groupby(res['id']).ffill().where(mask)

Output:

   id       date  result  result1
0   1 2020-01-01     1.5      1.5
1   1 2020-01-05     NaN      1.5
2   1 2020-02-03     NaN      NaN
3   1 2020-02-05     2.6      2.6
4   1 2020-04-01     NaN      NaN
5   2 2020-01-01     NaN      NaN
6   2 2020-01-02     6.0      6.0

Upvotes: 1

BENY
BENY

Reputation: 323316

You can try with merge_asof

res['date']=pd.to_datetime(res['date'])
res = res.sort_values('date')
res1 = res.dropna(subset=['result']).rename(columns={'result':'result1'})
out = pd.merge_asof(res.reset_index(),res1 , by ='id', on ='date',tolerance = pd.Timedelta(30, unit='d'),direction = 'backward').sort_values('index')
Out[72]: 
   index  id       date  result  result1
0      0   1 2020-01-01     1.5      1.5
3      1   1 2020-01-05     NaN      1.5
4      2   1 2020-02-03     NaN      NaN
5      3   1 2020-02-05     2.6      2.6
6      4   1 2020-04-01     NaN      NaN
1      5   2 2020-01-01     NaN      NaN
2      6   2 2020-01-02     6.0      6.0

Upvotes: 3

Related Questions