Reputation: 177
I have a large data set with data from multiple locations (given in lat/long) over 80 years. I am trying to calculate a 10 year average for both column a and b from each site for the entire span of the time frame. Below is a sample of of the data table.
Lat Long Year Month Day a b
46.90625 -115.46875 1950 01 01 0.0000 1.1335
46.90625 -115.46875 1950 01 02 0.0000 1.1276
46.90625 -115.46875 1950 01 03 0.0000 1.1213
Here is a sample of what I have tried but keep getting lost on.
fname = output1
df = pandas.read_table(output1)
lat_long_group = df.groupby(['Lat','Long','Year']).agg(['mean','count'])
monthly_average = lat_long_group.aggregate({'a':numpy.mean,
'b': numpy.mean})
Upvotes: 3
Views: 2431
Reputation: 109528
First, create a column based on Pandas Timestamps:
df = df.dropna()
df['date'] = df.apply(lambda x: pd.Timestamp('{year}-{month}-{day}'
.format(year=int(x.Year),
month=int(x.Month),
day=int(x.Day))),
axis=1)
Next, set you location based on tuple pairs of Lat and Long.
df['Location'] = zip(df.Lat, df.Long)
Now, delete the redundant data.
df.drop(['Year', 'Month', 'Day', 'Lat', 'Long'], axis=1, inplace=True)
We can now pivot the data by date and location. Your new DataFrame is now indexed on the date:
df2 = df.pivot(index='date', columns='Location')
Swap the levels of the new columns (so that location is on top of the values).
df2.columns = df2.columns.swaplevel('Location', None)
Finally, use resample
to get the mean value of your data over ten year periods:
>>> df2.resample('10A', how='mean') # 'A'=Annual, '10A'=TenYears
Location (46.90625, -115.46875)
a b
date
1950-12-31 0 1.127484
1960-12-31 0 1.127467
1970-12-31 0 1.127467
1980-12-31 0 1.127467
1990-12-31 0 1.127467
2000-12-31 0 1.127467
2010-12-31 0 1.127467
2020-12-31 0 1.127467
2030-12-31 0 1.127467
2040-12-31 0 1.127452
I used identical data for 30k rows (except for the dates, of course), but you can see how the process would work.
Note that the data is broken into even ten year blocks, so you may have stubs in your data at both ends (e.g. if your data started in 1947 then the first period would only be 3-4 years.
Upvotes: 5