Reputation: 341
I've had a look and can't seem to find a solution to this issue. I'm wanting to calculate the rolling sum of the previous 30 days' worth of data at each date in the dataframe - by subgroup - for a set of data that isn't daily - it's spaced fairly irregularly. I've been attempting to use ChatGPT which is getting in a twist over it.
Initially the suggestion was that I'd not converted the Date column to datetime format to allow for the rolling calculation, but now from the code below:
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
# Create a dataset with irregularly spaced dates spanning two years
np.random.seed(42)
date_rng = pd.date_range(start='2022-01-01', end='2023-12-31', freq='10D') # Every 10 days
data = {'Date': np.random.choice(date_rng, size=30),
'Group': np.random.choice(['A', 'B'], size=30),
'Value': np.random.randint(1, 30, size=30)}
df = pd.DataFrame(data)
# Sort DataFrame by date
df.sort_values(by='Date', inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
# Calculate cumulative sum by group within the previous 30 days from each day
df['RollingSum_Last30Days'] = df.groupby('Group')['Value'].transform(lambda x: x.rolling(window='30D', min_periods=1).sum())
I'm getting an error of:
ValueError: window must be an integer 0 or greater
I've found conflicting comments online as to whether the format '30D' works in rolling windows but I'm none the wiser as to a solution to this. Any help appreciated.
Running in VSCode in Python 3.11.8.
Upvotes: 2
Views: 547
Reputation: 2152
Instead of using lambda, Create a function. Using a function is more readable.
import pandas as pd
data = {
'Date': [
'2024-01-01', '2024-01-05', '2024-01-10', '2024-02-01',
'2024-02-15', '2024-03-01', '2024-03-10', '2024-04-01',
'2024-04-10', '2024-05-01', '2024-05-15', '2024-06-01'
],
'Group': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 30, 25, 10, 35, 20, 5, 40, 30, 25]
}
df = pd.DataFrame(data)
# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Sort DataFrame by date
df.sort_values(by='Date', inplace=True)
# Set Date column as index
df.set_index('Date', inplace=True)
def calculate_rolling_sum(gr):
aa = gr.rolling(window='30D',min_periods= 1)['Value'].sum()
return aa
rolling_sums = df.groupby('Group',group_keys=False)\
.apply(calculate_rolling_sum,include_groups= False)
df['RollingSum_Last30Days'] = rolling_sums
print(df)
'''
Group Value RollingSum_Last30Days
Date
2024-01-01 A 10 10.0
2024-01-05 A 20 30.0
2024-01-10 B 15 15.0
2024-02-01 B 30 45.0
2024-02-15 A 25 25.0
2024-03-01 B 10 40.0
2024-03-10 A 35 60.0
2024-04-01 B 20 20.0
2024-04-10 A 5 5.0
2024-05-01 B 40 40.0
2024-05-15 A 30 30.0
2024-06-01 B 25 25.0
'''
Upvotes: 0
Reputation: 262214
The issue if that you need to specify which column to use as Date but don't have access to the Date with groupby.transform
.
You could use groupby.apply
:
# Calculate cumulative sum by group within the previous 30 days from each day
df['RollingSum_Last30Days'] = (df.groupby('Group', group_keys=False)
.apply(lambda x: x.rolling(window='30D', on='Date', min_periods=1)['Value'].sum())
)
Output:
Date Group Value RollingSum_Last30Days
9 2022-01-11 A 22 22.0
12 2022-01-11 A 22 44.0
6 2022-01-21 A 4 48.0
1 2022-05-21 B 14 14.0
23 2022-05-21 A 8 8.0
15 2022-07-20 B 26 26.0
4 2022-07-20 A 18 18.0
18 2022-07-30 B 10 36.0
7 2022-07-30 A 2 20.0
5 2022-08-19 A 8 10.0
10 2022-10-18 B 10 10.0
16 2022-11-17 B 12 12.0
11 2023-01-06 B 4 4.0
21 2023-02-15 B 16 16.0
26 2023-04-06 B 28 28.0
19 2023-04-26 A 4 4.0
28 2023-05-16 B 8 8.0
0 2023-05-26 B 3 11.0
8 2023-06-05 A 6 6.0
29 2023-06-25 A 21 27.0
17 2023-07-25 A 2 2.0
20 2023-08-04 B 14 14.0
22 2023-08-14 B 15 29.0
14 2023-08-14 B 18 47.0
3 2023-08-24 A 4 4.0
24 2023-09-03 B 14 47.0
25 2023-09-03 A 23 27.0
27 2023-09-03 A 25 52.0
13 2023-09-23 B 29 43.0
2 2023-12-12 A 17 17.0
Upvotes: 1