user4505419
user4505419

Reputation: 341

Weird behavior with pandas Grouper method with datetime objects

I am trying to make groups of x days within groups of another column. For some reason the grouping behavior is changed when I add another level of grouping.

See toy example below:

Create a random dataframe with 40 consecutive dates, an ID column and random values:

import numpy as np
import pandas as pd
df = pd.DataFrame(
        {'dates':pd.date_range('2018-1-1',periods=40,freq='D'),
         'id': np.concatenate((np.repeat(1,10),np.repeat(2,30))),
         'amount':np.random.random(40)
         }
)

I want to group by id first and then make groups of let's say 7 consecutive days within these groups. I do:

(df
 .groupby(['id',pd.Grouper(key='dates',freq='7D')])
 .amount
 .agg(['mean','count'])
)

And the output is:

                   mean  count
id dates                      
1  2018-01-01  0.591755      7
   2018-01-08  0.701657      3
2  2018-01-08  0.235837      4
   2018-01-15  0.650085      7
   2018-01-22  0.463854      7
   2018-01-29  0.643556      7
   2018-02-05  0.459864      5

There is something weird going on in the second group! I would expect to see 4 groups of 7 and then a last group of 2. When I run the same code on a dataframe with just the id=2 I do get what I actually expect:

df2=df[df.id==2]

(df2
 .groupby(['id',pd.Grouper(key='dates',freq='7D')])
 .amount
 .agg(['mean','count'])
)

Output

                   mean  count
id dates                      
2  2018-01-11  0.389343      7
   2018-01-18  0.672550      7
   2018-01-25  0.486620      7
   2018-02-01  0.520816      7
   2018-02-08  0.529915      2

What is going on here? Is it first creating a group of 4 in the id=2 group because the last group in id=1 group was only 3 rows? This is not what I want to do!

Upvotes: 3

Views: 824

Answers (1)

cs95
cs95

Reputation: 402673

When you group with both IDs, you have a spillover from the first group into the second when you perform a weekly groupby (because there are not enough days in the last week to complete a full 7 days in group #1). This is obvious when you look at the first date per group:

"2018-01-08" in the first case v/s "2018-01-11".

The workaround is to perform a groupby on id and then apply a resampling operation:

df.groupby('id').apply(
    lambda x: x.set_index('dates').amount.resample('7D').count()
)

id  dates     
1   2018-01-01    7
    2018-01-08    3
2   2018-01-11    7
    2018-01-18    7
    2018-01-25    7
    2018-02-01    7
    2018-02-08    2
Name: amount, dtype: int64

Upvotes: 2

Related Questions