Reputation: 209
What's the best way in Pandas to resample/group/etc by year, but instead of going by calendar years, calculate full years starting with the last date in the data?
Example data set
pd.DataFrame({
'MyDate': ['2017-02-01', '2017-07-05', '2017-08-26', '2017-09-03', '2018-02-04',
'2018-08-03', '2018-08-10', '2018-12-03', '2019-07-13', '2019-08-15'],
'MyValue': [100, 90, 80, 70, 60, 50, 40, 30, 20, 10]
})
MyDate MyValue
0 2017-02-01 100
1 2017-07-05 90
2 2017-08-26 80
3 2017-09-03 70
4 2018-02-04 60
5 2018-08-03 50
6 2018-08-10 40
7 2018-12-03 30
8 2019-07-13 20
9 2019-08-15 10
Example result
Last date is 2019-08-15, so I'd like to group by the last full year 2018-08-16 - 2019-08-15, the 2017-08-17 - 2018-08-15, etc.
Here getting the last result per such year:
MyDate MyValue
0 2017-07-05 90
1 2018-08-10 40
2 2019-08-15 10
Upvotes: 3
Views: 1460
Reputation: 476584
You first need to parse your dates to real date objects, like:
df['MyDate'] = pd.to_datetime(df['MyDate'])
Next we can perform a group by with a relativedelta
from the python-dateutil
package:
>>> from operator import attrgetter
>>> from dateutil.relativedelta import relativedelta
>>> df.groupby(df['MyDate'].apply(relativedelta, dt2=df['MyDate'].max()).apply(attrgetter('years'))).last()
MyDate MyValue
MyDate
-2 2017-07-05 90
-1 2018-08-10 40
0 2019-08-15 10
Upvotes: 3
Reputation: 862591
You can subtract last value and create years groups and pass to groupby
with GroupBy.last
:
df['MyDate'] = pd.to_datetime(df['MyDate'])
s = (df['MyDate'].sub(df['MyDate'].iat[-1]).dt.days / 365.25).astype(int)
df = df.groupby(s).last().reset_index(drop=True)
print (df)
MyDate MyValue
0 2017-07-05 90
1 2018-08-10 40
2 2019-08-15 10
Upvotes: 3
Reputation: 59529
One way is to use pd.cut
, specifying the bins with pd.offsets.DateOffset
to get calendar year separation.
import numpy as np
import pandas as pd
df['MyDate'] = pd.to_datetime(df['MyDate'])
N = int(np.ceil((df.MyDate.max()-df.MyDate.min())/np.timedelta64(1, 'Y')))+1
bins = [df.MyDate.max()-pd.offsets.DateOffset(years=y) for y in range(N)][::-1]
df.groupby(pd.cut(df.MyDate, bins)).last()
# MyDate MyValue
#MyDate
#(2016-08-15, 2017-08-15] 2017-07-05 90
#(2017-08-15, 2018-08-15] 2018-08-10 40
#(2018-08-15, 2019-08-15] 2019-08-15 10
Upvotes: 2