Reputation: 35
I have a dataframe with data for 20 years with the time as datatime index.
EDIT
Time value
1999-01-01 00:00:00 7
1999-01-01 01:00:00 4
1999-01-01 02:00:00 9
1999-01-01 03:00:00 4
1999-01-01 04:00:00 2
...
2018-12-31 19:00:00 8
2018-12-31 20:00:00 1
2018-12-31 21:00:00 9
2018-12-31 22:00:00 6
2018-12-31 23:00:00 5
I needed to get the sum for every month, which I did like
df_new = df.groupby([df.index.year, df.index.month]).sum()
And the result
Month value
1999 1 6
2 9
3 7
2000 1 5
2 7
3 6
2001 1 4
2 6
3 8
2002 1 7
2 9
3 8
2003 1 5
2 7
3 7
....
2018 1 9
2 6
3 7
But now I need a way to calculate the average for the past 3 years for every month. For example for 2002 I will have: for month 1 the average of month 1 of 1999,2000,2001, for month 2 the average of month 2 of 1999,2000,2001 and so on. Then for 2003 the average will be for 2000, 2001 and 2002 and like that until 2018. Meaning the first 3 years I will get Nan values.
So my final output should look like this:
Month value average_past_3_years
1999 1 6 nan
2 9 nan
3 7 nan
2000 1 5 nan
2 7 nan
3 6 nan
2001 1 4 nan
2 6 nan
3 8 nan
2002 1 7 5.0
2 9 7.3
3 8 7.0
2003 1 5 5.3
2 7 7.3
3 7 7.3
...
I was thinking to use .apply() but can not figure it out a function to make it work :(
df.groupby([df.index.year, df.index.month]).apply(somefunction)
Upvotes: 1
Views: 3498
Reputation: 2757
Groupby can certainly do the trick. Here is another approach using stack
and unstack
to achieve vectorization,
(df.set_index(['Year', 'Month'])['value'] # set up indexed-series
.unstack('Month') # reshape into matrix
.rolling(3) # rolling mean, across all months
.mean()
.stack(dropna=False)) # Reshape back
Upvotes: 0
Reputation: 149185
I could not guess what were the columns and indexes in your dataframe. So assuming that it is:
df = pd.DataFrame({'year': [1999.0, 1999.0, 1999.0, 2000.0, 2000.0, 2000.0,
2001.0, 2001.0, 2001.0, 2002.0, 2002.0, 2002.0,
2003.0, 2003.0, 2003.0],
'Month': ['1', '2', '3', '1', '2', '3', '1', '2', '3',
'1', '2', '3', '1', '2', '3'],
'value': ['6', '9', '7', '5', '7', '6', '4', '6', '8',
'7', '9', '8', '5', '7', '7']})
giving:
0 year Month value
1 1999 1 6
2 1999 2 9
3 1999 3 7
4 2000 1 5
5 2000 2 7
6 2000 3 6
7 2001 1 4
8 2001 2 6
9 2001 3 8
10 2002 1 7
11 2002 2 9
12 2002 3 8
13 2003 1 5
14 2003 2 7
15 2003 3 7
You can group by month and use a rolling windows of size 3 to compute the rolling sum of the last 3 years per month, and shift the result to align it:
df['average_past_3_years'] = df.groupby('Month').rolling(3).agg(
{'value':'mean', 'year': 'max'}).reset_index(level=0).groupby(
'Month').transform('shift')['value']
It will give as expected:
0 year Month value average_past_3_years
1 1999 1 6 NaN
2 1999 2 9 NaN
3 1999 3 7 NaN
4 2000 1 5 NaN
5 2000 2 7 NaN
6 2000 3 6 NaN
7 2001 1 4 NaN
8 2001 2 6 NaN
9 2001 3 8 NaN
10 2002 1 7 5.000000
11 2002 2 9 7.333333
12 2002 3 8 7.000000
13 2003 1 5 5.333333
14 2003 2 7 7.333333
15 2003 3 7 7.333333
Upvotes: 1