Duy Bui
Duy Bui

Reputation: 1396

Pandas Rolling window with filtering condition to remove the some latest data

This is a follow-up question of this. I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)

Here is an example:

d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021'],
     'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}

df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
   Name       Date  Earning
0  Jack 2021-01-08       40
1   Jim 2021-01-27       10
2  Jack 2021-02-05       20
3   Jim 2021-02-10       20
4  Jack 2021-02-17       40
5  Jack 2021-02-18       50
6   Jim 2021-02-20      100
7  Jack 2021-02-21       70
8  Jane 2021-02-22       80
9  Jane 2021-03-29       90

I would like to

Expected outcome: (The two columns Window_From and Window_To are not needed. I only use them to demonstrate the mock data)

   Name       Date  Earning Window_From  Window_To   Sum
0  Jack 2021-01-08       40  2020-12-09 2020-12-19   0.0
1   Jim 2021-01-27       10  2020-12-28 2021-01-07   0.0
2  Jack 2021-02-05       20  2021-01-06 2021-01-16  40.0
3   Jim 2021-02-10       20  2021-01-11 2021-01-21   0.0
4  Jack 2021-02-17       40  2021-01-18 2021-01-28   0.0
5  Jack 2021-02-18       50  2021-01-19 2021-01-29   0.0
6   Jim 2021-02-20      100  2021-01-21 2021-01-31  10.0
7  Jack 2021-02-21       70  2021-01-22 2021-02-01   0.0
8  Jane 2021-02-22       80  2021-01-23 2021-02-02   0.0
9  Jane 2021-03-29       90  2021-02-27 2021-03-09   0.0

Upvotes: 2

Views: 2444

Answers (2)

user2246849
user2246849

Reputation: 4407

An alternative to rolling (may be faster):

EDIT: actually slower with OP's dataset.

df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10) 
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum() 
             for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]

print(df.reset_index().drop(columns=['start', 'end']))
   Name       Date  Earning  Sum
0  Jack 2021-01-08       40    0
1   Jim 2021-01-27       10    0
2  Jack 2021-02-05       20   40
3   Jim 2021-02-10       20    0
4  Jack 2021-02-17       40    0
5  Jack 2021-02-18       50    0
6   Jim 2021-02-20      100   10
7  Jack 2021-02-21       70    0
8  Jane 2021-02-22       80    0
9  Jane 2021-03-29       90    0

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

Easy solution

Calculate 30 days and 20 days rolling sum then subtract 30 day sum from 20 day sum to get the effective rolling sum for first 10 days

s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()

df.merge(s1.sub(s2).reset_index(name='sum'), how='left')

   Name       Date  Earning   sum
0  Jack 2021-01-08       40   0.0
1   Jim 2021-01-27       10   0.0
2  Jack 2021-02-05       20  40.0
3   Jim 2021-02-10       20   0.0
4  Jack 2021-02-17       40   0.0
5  Jack 2021-02-18       50   0.0
6   Jim 2021-02-20      100  10.0
7  Jack 2021-02-21       70   0.0
8  Jane 2021-02-22       80   0.0
9  Jane 2021-03-29       90   0.0

Upvotes: 4

Related Questions