ira
ira

Reputation: 2644

Python pandas - Efficiently apply function over rolling window by group with missing dates

Note: I already know the answer to this question, i am posting it just because i couldn't find proper answer on stack overflow and it took me surprising amount of time to figure it out. That being said, feel free to suggest other options.

The Question

I have a pandas DataFrame with three columns, one keeps a track of date, one keeps track of related observations (i.e. my grouping column) and third variable stores some numeric values. For each group in my dataframe, i'd like to calculate a rolling sum over the date column. The caveat: some dates are missing in the dataframe, i'd like to treat those as observations with value 0. I do not want to use cross join to add all the dates.

Reproducible example

Let's have a dataframe like this:

df = pd.DataFrame({'id_col' : np.array([1,1,1,1,1,1,2,2,2,2,2,3,3,3]),
                  'value_col' : np.random.randint(0,5,size=14),
                  'dates' : pd.to_datetime(np.array([
                  '2018-01-01',
                  '2018-02-01',
                  '2018-03-01',
                  '2018-05-01',
                  '2018-06-01',
                  '2018-09-01',
                  '2018-01-01',
                  '2018-02-01',
                  '2018-05-01',
                  '2018-06-01',
                  '2018-07-01',
                  '2018-01-01',
                  '2018-02-01',
                  '2018-03-01'])
                  )}
                 )

The data looks something like this:

    id_col  value_col   dates
0   1   0   2018-01-01
1   1   1   2018-02-01
2   1   4   2018-03-01
3   1   0   2018-05-01
4   1   3   2018-06-01
5   1   3   2018-09-01
6   2   4   2018-01-01
7   2   3   2018-02-01
8   2   2   2018-05-01
9   2   0   2018-06-01
10  2   2   2018-07-01
11  3   4   2018-01-01
12  3   2   2018-02-01
13  3   3   2018-03-01

Approaches that i tried and do not work:

Option 1: completely ignores missing dates

df.groupby(['id_col']).rolling(2)['value_col'].sum().reset_index()

Option 2: Based on pandas rolling documentation, replace width of window using the pandas offset parameters (returns ValueError: window must be an integer). This approach would work if the dates column was used as index of a dataframe. Unfortunately, we can't use simple index here as 2 different IDs from id_col can contain the same date. (We could create MultiIndex, but then get the same value error).

df.groupby(['id_col']).rolling('60d')['value_col'].sum().reset_index()

What does work but isn't very straightforward:

Option 1: Cross join to fill in all missing dates (might be tough if you have a lot of data)

Option 2: Build multiindex from cartesian product of iterables as in this answer. This is actually quite similar to the above option.

Upvotes: 2

Views: 1045

Answers (1)

ira
ira

Reputation: 2644

Use the on parameter of rolling. The documentation actually mentions it, although there is no example to see the appropriate usage. Luckily, there is pandas github and this issue, which, if you go through the comments, gives some insight on how to actually use the rolling function with offset windows properly.

Therefore, the solution would be:

df.groupby(['id_col']).rolling('60d', on = 'dates')['value_col'].sum().reset_index()

Note the ussage of 60d as a proxy for 2 months instead of 2m this is because 2m would give you the following error: ValueError: <2 * MonthEnds> is a non-fixed frequency. For more on this issue, check the stackoverflow question here.

Upvotes: 4

Related Questions