Reputation: 2644
Note: I already know the answer to this question, i am posting it just because i couldn't find proper answer on stack overflow and it took me surprising amount of time to figure it out. That being said, feel free to suggest other options.
The Question
I have a pandas DataFrame with three columns, one keeps a track of date, one keeps track of related observations (i.e. my grouping column) and third variable stores some numeric values. For each group in my dataframe, i'd like to calculate a rolling sum over the date column. The caveat: some dates are missing in the dataframe, i'd like to treat those as observations with value 0. I do not want to use cross join to add all the dates.
Reproducible example
Let's have a dataframe like this:
df = pd.DataFrame({'id_col' : np.array([1,1,1,1,1,1,2,2,2,2,2,3,3,3]),
'value_col' : np.random.randint(0,5,size=14),
'dates' : pd.to_datetime(np.array([
'2018-01-01',
'2018-02-01',
'2018-03-01',
'2018-05-01',
'2018-06-01',
'2018-09-01',
'2018-01-01',
'2018-02-01',
'2018-05-01',
'2018-06-01',
'2018-07-01',
'2018-01-01',
'2018-02-01',
'2018-03-01'])
)}
)
The data looks something like this:
id_col value_col dates
0 1 0 2018-01-01
1 1 1 2018-02-01
2 1 4 2018-03-01
3 1 0 2018-05-01
4 1 3 2018-06-01
5 1 3 2018-09-01
6 2 4 2018-01-01
7 2 3 2018-02-01
8 2 2 2018-05-01
9 2 0 2018-06-01
10 2 2 2018-07-01
11 3 4 2018-01-01
12 3 2 2018-02-01
13 3 3 2018-03-01
Approaches that i tried and do not work:
Option 1: completely ignores missing dates
df.groupby(['id_col']).rolling(2)['value_col'].sum().reset_index()
Option 2: Based on pandas rolling documentation, replace width of window using the pandas offset parameters (returns ValueError: window must be an integer
). This approach would work if the dates column was used as index of a dataframe. Unfortunately, we can't use simple index here as 2 different IDs from id_col can contain the same date. (We could create MultiIndex, but then get the same value error).
df.groupby(['id_col']).rolling('60d')['value_col'].sum().reset_index()
What does work but isn't very straightforward:
Option 1: Cross join to fill in all missing dates (might be tough if you have a lot of data)
Option 2: Build multiindex from cartesian product of iterables as in this answer. This is actually quite similar to the above option.
Upvotes: 2
Views: 1045
Reputation: 2644
Use the on
parameter of rolling
. The documentation actually mentions it, although there is no example to see the appropriate usage. Luckily, there is pandas github and this issue, which, if you go through the comments, gives some insight on how to actually use the rolling function with offset windows properly.
Therefore, the solution would be:
df.groupby(['id_col']).rolling('60d', on = 'dates')['value_col'].sum().reset_index()
Note the ussage of 60d
as a proxy for 2 months instead of 2m
this is because 2m
would give you the following error: ValueError: <2 * MonthEnds> is a non-fixed frequency
. For more on this issue, check the stackoverflow question here.
Upvotes: 4