Python Pandas groupby or rolling multi year average summary statistics

Question

I have a pandas time series data frame with approximately 20 rows for for each year, from 2014 to 2017, and I'm trying to calculate the mean value for each two year period. For example: 01/1/2014 ... 31/12/2015, 01/1/2015 ... 31/12/2016, 01/1/2016 ... 31/12/2017

Here is the code I'm using to import the DataFrame:

import pandas as pd

infile = 'https://environment.data.gov.uk/bwq/downloadAPI/requestDownload?report=samples&bw=ukj2100-14950&to=2018-02-05&from=2014-05-01'
df = pd.read_csv(infile,compression='zip',usecols=['intestinalEnterococciCount','sampleTime'], parse_dates=['sampleTime'],infer_datetime_format=True,index_col=['sampleTime'],na_values=True)

and an example of the DataFrame:

                     intestinalEnterococciCount
sampleTime                                     
2014-05-12 13:00:00                          10
2014-05-21 12:27:00                          10
2014-05-27 10:55:00                          10
2014-06-06 12:19:00                          10
2014-06-09 13:26:00                          10

I would like to calculate the mean value for each two year period. The expected answers would be:

Period                Mean
Jan 2014 - Dec 2015:  33.575
Jan 2015 - Dec 2016:  22.85
Jan 2016 - Dec 2017:  25.5

What I tried:

I know I could use a loop and iterate through a list of the two year periods and calculate it that way, but I’m sure there must be a nicer way to achieve this using Pandas.
I tried using .rolling but that appears to give a rolling mean, which increments forward row by row, rather than over two year periods.
I can successfully use groupby(df.index.year).mean to get the mean for each year, but how would I go about it for calculating it for each two year period?

Peter Leimbigler · Accepted Answer

To get other aggregate statistics like standard deviation and geometric mean, here's a somewhat hackish way:

df_std = pd.DataFrame([df[str(y):str(y+2)].std() for y in df.index.year.unique()])
df_std.index = df.index.year.unique().sort_values()

df_std
            intestinalEnterococciCount
sampleTime
2014                         63.825528
2015                         37.596271
2016                         34.845224
2017                         51.384066

from scipy.stats.mstats import gmean
df_gm = pd.DataFrame([df[str(y):str(y+2)].agg(gmean) for y in df.index.year.unique()])
df_gm.index = df.index.year.unique().sort_values()

df_gm
            intestinalEnterococciCount
sampleTime
2014                         16.230186
2015                         16.136248
2016                         16.377124
2017                         19.529690

Python Pandas groupby or rolling multi year average summary statistics

Answers (2)

Related Questions