FooBar
FooBar

Reputation: 16488

Pandas TimeGrouper: Drop "non full groups"

I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data.

df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot()

I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this to happen because with biannual grouping, the second half (2014) is missing.

rolling_mean allows center=True, which will will put NaN/drop remainders on the left and right. Is there a similar feature for the Grouper? I couldn't find any on the manual, but perhaps there is a workaround?

left over

Upvotes: 2

Views: 599

Answers (1)

JohnE
JohnE

Reputation: 30424

I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of:

1) Drop enough observations (at the start or end) such that you have a multiple of 2 years worth of observations.

2) Extrapolate your starting (or ending) period such that it is comparable to the periods with complete data.

3) Normalize your data to 2 year sums based on underlying time periods of less than 2 years. This approach could be combined with the other two.

4) Instead of a groupby sort of approach, just do a rolling_sum.


Example dataframe:

rng = pd.date_range('1/1/2010', periods=60, freq='1m')
df = pd.DataFrame({ 'shopping' : np.random.choice(12,60) }, index=rng )

I just made the example data set with 5 years of data starting on Jan 1, so if you did this on an annual basis, you'd be done.

df.groupby([pd.TimeGrouper("AS", label='left')]).sum()['shopping']
Out[206]: 
2010-01-01    78
2011-01-01    60
2012-01-01    76
2013-01-01    51
2014-01-01    60
Freq: AS-JAN, Name: shopping, dtype: int64

Here's your problem in table form, with the first 2 groups based on 2 years of data but the third group based on only 1 year of data.

df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping']
Out[205]: 
2010-01-01    138
2012-01-01    127
2014-01-01     60
Freq: 2AS-JAN, Name: shopping, dtype: int64

If you take approach (1) above, you just need to drop some observations. It's very easy to drop the later observations and retype the same command. It's a little trickier to drop the earlier observations because then your first observation doesn't begin on Jan 1 of an even year and you lose the automatic labelling and such. Here's an approach that will drop the first year and keep the last 4, but you lose the nice labelling (you can compare with annual data above to verify that this is correct):

In [202]: df2 = df[12:]

In [203]: df2['group24'] = (np.arange( len(df2) ) / 24 ).astype(int)

In [204]: df2.groupby('group24').sum()['shopping']
Out[204]: 
group24
0          136
1          111

Alternatively, let's try approach (2), extrapolating. To do this, just replace sum() with mean() and multiply by 24. For the last period, this just means we assume that the 60 in 2014 will be equaled by another 60 in 2015. Whether or not this is reasonable will be a judgement call for you to make and you'd probably want to label with an asterisk and call it an estimate.

df.groupby([pd.TimeGrouper("2AS")]).mean()['shopping']*24
Out[208]: 
2010-01-01    138
2012-01-01    127
2014-01-01    120
Freq: 2AS-JAN, Name: shopping, dtype: float64

Also keep in mind this is just one simple (probably simplistic) way you could extrapolate at the end of the period. Whether this is the best way to do it (or whether it makes sense to extrapolate at all) is a judgement call for you to make depending on the situation.

Next, you could take approach (3) and do some sort of normalization. I'm not sure exactly what you want, so I'll just sketch the ideas. If you want to display two year sums, you could just use the earlier example of replacing "2AS" with "AS" and then multiply by 2. This basically makes the table look wrong, but would be a really simple way to make the graph look OK.

Finally, just use rolling sum:

pd.rolling_sum(df.shopping,window=24)

Doesn't table well, but would plot well.

Upvotes: 2

Related Questions