Reputation: 13
I have a pandas time series data frame with approximately 20 rows for for each year, from 2014 to 2017, and I'm trying to calculate the mean value for each two year period. For example: 01/1/2014 ... 31/12/2015, 01/1/2015 ... 31/12/2016, 01/1/2016 ... 31/12/2017
Here is the code I'm using to import the DataFrame:
import pandas as pd
infile = 'https://environment.data.gov.uk/bwq/downloadAPI/requestDownload?report=samples&bw=ukj2100-14950&to=2018-02-05&from=2014-05-01'
df = pd.read_csv(infile,compression='zip',usecols=['intestinalEnterococciCount','sampleTime'], parse_dates=['sampleTime'],infer_datetime_format=True,index_col=['sampleTime'],na_values=True)
and an example of the DataFrame:
intestinalEnterococciCount
sampleTime
2014-05-12 13:00:00 10
2014-05-21 12:27:00 10
2014-05-27 10:55:00 10
2014-06-06 12:19:00 10
2014-06-09 13:26:00 10
I would like to calculate the mean value for each two year period. The expected answers would be:
Period Mean
Jan 2014 - Dec 2015: 33.575
Jan 2015 - Dec 2016: 22.85
Jan 2016 - Dec 2017: 25.5
What I tried:
.rolling
but that appears to give a rolling mean, which increments forward row by row, rather than over two year periods.groupby(df.index.year).mean
to get the mean for each year, but how would I go about it for calculating it for each two year period?Upvotes: 1
Views: 853
Reputation: 323306
You can using groupby
and rolling
, make sure you record count and sum for future calculation of mean ,(you just need make the change of index to what you need by using s.index=[your index list]
)
s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.agg(['sum','count'])
s=s.rolling(window=2).sum()
s['mean']=s['sum']/s['count']
s.dropna()
Out[564]:
sum count mean
2015 1343.0 40.0 33.575
2016 914.0 40.0 22.850
2017 765.0 30.0 25.500
Update:
s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.apply(list)
(s+s.shift()).dropna().apply(pd.Series).stack().std(level=0)
Out[601]:
2015 76.472179
2016 33.701974
2017 34.845224
dtype: float64
Upvotes: 1
Reputation: 11105
To get other aggregate statistics like standard deviation and geometric mean, here's a somewhat hackish way:
df_std = pd.DataFrame([df[str(y):str(y+2)].std() for y in df.index.year.unique()])
df_std.index = df.index.year.unique().sort_values()
df_std
intestinalEnterococciCount
sampleTime
2014 63.825528
2015 37.596271
2016 34.845224
2017 51.384066
from scipy.stats.mstats import gmean
df_gm = pd.DataFrame([df[str(y):str(y+2)].agg(gmean) for y in df.index.year.unique()])
df_gm.index = df.index.year.unique().sort_values()
df_gm
intestinalEnterococciCount
sampleTime
2014 16.230186
2015 16.136248
2016 16.377124
2017 19.529690
Upvotes: 0