Reputation: 341
I am trying to make groups of x days within groups of another column. For some reason the grouping behavior is changed when I add another level of grouping.
See toy example below:
Create a random dataframe with 40 consecutive dates, an ID column and random values:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'dates':pd.date_range('2018-1-1',periods=40,freq='D'),
'id': np.concatenate((np.repeat(1,10),np.repeat(2,30))),
'amount':np.random.random(40)
}
)
I want to group by id
first and then make groups of let's say 7 consecutive days within these groups. I do:
(df
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
And the output is:
mean count
id dates
1 2018-01-01 0.591755 7
2018-01-08 0.701657 3
2 2018-01-08 0.235837 4
2018-01-15 0.650085 7
2018-01-22 0.463854 7
2018-01-29 0.643556 7
2018-02-05 0.459864 5
There is something weird going on in the second group! I would expect to see 4 groups of 7 and then a last group of 2. When I run the same code on a dataframe with just the id=2
I do get what I actually expect:
df2=df[df.id==2]
(df2
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
Output
mean count
id dates
2 2018-01-11 0.389343 7
2018-01-18 0.672550 7
2018-01-25 0.486620 7
2018-02-01 0.520816 7
2018-02-08 0.529915 2
What is going on here? Is it first creating a group of 4 in the id=2
group because the last group in id=1
group was only 3 rows? This is not what I want to do!
Upvotes: 3
Views: 824
Reputation: 402673
When you group with both IDs, you have a spillover from the first group into the second when you perform a weekly groupby (because there are not enough days in the last week to complete a full 7 days in group #1). This is obvious when you look at the first date per group:
"2018-01-08" in the first case v/s "2018-01-11".
The workaround is to perform a groupby
on id
and then apply
a resampling operation:
df.groupby('id').apply(
lambda x: x.set_index('dates').amount.resample('7D').count()
)
id dates
1 2018-01-01 7
2018-01-08 3
2 2018-01-11 7
2018-01-18 7
2018-01-25 7
2018-02-01 7
2018-02-08 2
Name: amount, dtype: int64
Upvotes: 2