mar_p
mar_p

Reputation: 35

Pandas, how to calculate mean values of the past n years for every month

I have a dataframe with data for 20 years with the time as datatime index.

EDIT

Time value
1999-01-01 00:00:00 7 1999-01-01 01:00:00 4 1999-01-01 02:00:00 9 1999-01-01 03:00:00 4 1999-01-01 04:00:00 2 ... 2018-12-31 19:00:00 8 2018-12-31 20:00:00 1 2018-12-31 21:00:00 9 2018-12-31 22:00:00 6 2018-12-31 23:00:00 5

I needed to get the sum for every month, which I did like

df_new = df.groupby([df.index.year, df.index.month]).sum()

And the result

     Month  value
1999    1   6
        2   9
        3   7
2000    1   5
        2   7
        3   6
2001    1   4
        2   6
        3   8
2002    1   7
        2   9
        3   8
2003    1   5
        2   7
        3   7
        ....
2018    1   9
        2   6
        3   7

But now I need a way to calculate the average for the past 3 years for every month. For example for 2002 I will have: for month 1 the average of month 1 of 1999,2000,2001, for month 2 the average of month 2 of 1999,2000,2001 and so on. Then for 2003 the average will be for 2000, 2001 and 2002 and like that until 2018. Meaning the first 3 years I will get Nan values.

So my final output should look like this:

   Month    value   average_past_3_years
1999    1   6   nan
        2   9   nan
        3   7   nan
2000    1   5   nan
        2   7   nan
        3   6   nan
2001    1   4   nan
        2   6   nan
        3   8   nan
2002    1   7   5.0
        2   9   7.3
        3   8   7.0
2003    1   5   5.3
        2   7   7.3
        3   7   7.3
            ...

I was thinking to use .apply() but can not figure it out a function to make it work :(

df.groupby([df.index.year, df.index.month]).apply(somefunction)

Upvotes: 1

Views: 3498

Answers (2)

Mark Wang
Mark Wang

Reputation: 2757

Groupby can certainly do the trick. Here is another approach using stack and unstack to achieve vectorization,

(df.set_index(['Year', 'Month'])['value'] # set up indexed-series
   .unstack('Month')                        # reshape into matrix
   .rolling(3)                 # rolling mean, across all months
   .mean() 
   .stack(dropna=False))      # Reshape back 

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 149185

I could not guess what were the columns and indexes in your dataframe. So assuming that it is:

df = pd.DataFrame({'year': [1999.0, 1999.0, 1999.0, 2000.0, 2000.0, 2000.0,
                            2001.0, 2001.0, 2001.0, 2002.0, 2002.0, 2002.0,
                            2003.0, 2003.0, 2003.0],
                   'Month': ['1', '2', '3', '1', '2', '3', '1', '2', '3',
                             '1', '2', '3', '1', '2', '3'],
                   'value': ['6', '9', '7', '5', '7', '6', '4', '6', '8',
                             '7', '9', '8', '5', '7', '7']})

giving:

0   year Month value
1   1999     1     6
2   1999     2     9
3   1999     3     7
4   2000     1     5
5   2000     2     7
6   2000     3     6
7   2001     1     4
8   2001     2     6
9   2001     3     8
10  2002     1     7
11  2002     2     9
12  2002     3     8
13  2003     1     5
14  2003     2     7
15  2003     3     7

You can group by month and use a rolling windows of size 3 to compute the rolling sum of the last 3 years per month, and shift the result to align it:

df['average_past_3_years'] = df.groupby('Month').rolling(3).agg(
                      {'value':'mean', 'year': 'max'}).reset_index(level=0).groupby(
                      'Month').transform('shift')['value']

It will give as expected:

0   year Month value  average_past_3_years
1   1999     1     6                   NaN
2   1999     2     9                   NaN
3   1999     3     7                   NaN
4   2000     1     5                   NaN
5   2000     2     7                   NaN
6   2000     3     6                   NaN
7   2001     1     4                   NaN
8   2001     2     6                   NaN
9   2001     3     8                   NaN
10  2002     1     7              5.000000
11  2002     2     9              7.333333
12  2002     3     8              7.000000
13  2003     1     5              5.333333
14  2003     2     7              7.333333
15  2003     3     7              7.333333

Upvotes: 1

Related Questions