Reputation: 828
I looked at Panda and it might not be easy of use for my purpose. Probably there is a python toolkit out there and I'm not aware of. Could you tell me what package could easily deal with situations like the one described below?
I have a series of 2D numpy arrays of instant values of a meteorological variable (for a a geographical area), sampled at hours 0, 6, 12, ..., 96,... for a specific date.
I have it as a python dictionary:
values[0]:[[3, 2,...,9, 5][6, 7, ..., 6, 7]]
...
...
values[96]:[[2, 2,...,8, 5][6, 7, ..., 5, 6]]
I need to average on a configurable aggregation step. For example, for an aggregation step of 24 hours I would obtain 4 daily averages:
values_avg24h=average(values, aggr_step=24)
values_avg24h[24]=[[...][...]]
values_avg24h[48]=[[...][...]]
values_avg24h[72]=[[...][...]]
values_avg24h[96]=[[...][...]]
The time resolution can change. Also, original values can be averaged as well but on a different aggregation step. A specific existing package could solve all of that. Anyway, even a smart solution to this simple problem would be appreciated.
Upvotes: 0
Views: 5407
Reputation: 9676
If the data are sampled at a regular frequency (which your post seems to indicate) then pandas
certainly can help you.
One possible pandas
solution is to create a Panel
object where the items
(like keys in your dictionary) are of class pandas.tseries.index.DatetimeIndex
. The major_axis
and minor_axis
elements of the Panel
will then be the rows, and columns in your numpy arrays. This sounds a bit strange, so I will show a quick example.
>>> import pandas as pd
>>> import numpy as np
>>> start_date = '4/14/2013' # Start today (default time is midnight)
# create the date_range we will use as panel items
>>> ind = pd.date_range(start=start_date, periods=20, freq='6H')
# Create data: thanks @mtadd
>>> values = {h:np.array([np.random.random_integers(1,10,5) for _ in 'x'*2])
for h in np.r_[0:120:6]}
# Create the Panel object directly from the dict of 2D np.arrays
>>> my_panel = pd.Panel(values)
# Set the Panel's items to be the date_range we made earlier
>>> my_panel.items = ind
>>> my_panel.ix[0] # Show first 2D array of data
0 1 2 3 4
0 4 3 10 6 6
1 4 10 7 10 6
>>> values[0] # Same as above, but from the dict
array([[ 4, 3, 10, 6, 6],
[ 4, 10, 7, 10, 6]])
Now that we have things set up we will be using the Panel
method resample
. We will supply two arguments, the first will be the new frequency we want the data to be at and the second is a keyword argument how
that we wil specify as mean
. This is the main part of my answer!
>>> averaged = my_panel.resample('24H', how='mean')
>>> list(averaged.items) # Show new time stamps for data
[<Timestamp: 2013-04-14 00:00:00>,
<Timestamp: 2013-04-15 00:00:00>,
<Timestamp: 2013-04-16 00:00:00>,
<Timestamp: 2013-04-17 00:00:00>,
<Timestamp: 2013-04-18 00:00:00>]
Note that the '24H'
parameter we gave the resample
method was simply an extension of the example posted by the OA. See this link for more information on what types of descriptions this argument can take.
To verify that the mean was calculated correctly we will compare the first item in the averaged
Panel
to the mean we calculate by hand.
>>> averaged.ix[0]
0 1 2 3 4
0 6.5 5.25 7.25 6.25 6.50
1 5.0 6.75 5.25 6.50 5.25
>>> (values[0] + values[6] + values[12] + values[18]) / 4.
array([[ 6.5 , 5.25, 7.25, 6.25, 6.5 ],
[ 5. , 6.75, 5.25, 6.5 , 5.25]])
You can have much more flexibility over the time stamps for the items
if you pass a datetime.datetime
object in. For example, if you started sampling on New Years's Day at 1:37 PM and continued sampling for 50 periods you could do this:
>>> from datetime import datetime
>>> start = datetime(2013, 1, 1, 13, 37) # 1:37 on 1/1/2013
>>> ind = pd.date_range(start, periods=50, freq='6H')
Then you would carry on as before. You can also pass starting and ending items to date_range instead of the number of periods you want it to generate. If we collected data every 2 1/2 hours from the same starting time as above until 4:50 AM on February 28th, you could do the following:
>>> end = datetime(2013, 2, 28, 4, 50) # 4:50 AM on 2/28/2013
>>> ind = pd.date_range(start, freq='2H 30MIN')
Notice that you don't have to pass starting and ending times that align perfectly with the frequency you passed in. For more details on how to work with the Panel
object or time series data in pandas
check out the links given in those words.
Upvotes: 2
Reputation: 2555
import numpy as np
def average(values, aggr_step=24):
keys = np.array(values.keys())
bins = range(aggr_step,1+keys.max(),aggr_step)
npd = np.digitize(keys,bins)
return {b:np.mean(np.r_[[values[k] for k in keys[npd==i]]],axis=0)
for i,b in enumerate(bins)}
Whip up some sample values, and test:
values = {h:np.array([np.random.random_integers(1,10,5) for _ in 'x'*2])
for h in np.r_[0:100:6]}
{0: array([[ 1, 2, 10, 8, 7],
[10, 1, 8, 2, 5]]),
6: array([[ 7, 10, 7, 6, 4],
[ 9, 4, 6, 1, 1]]),
12: array([[ 7, 10, 10, 5, 4],
[ 7, 6, 2, 6, 7]]),
18: array([[9, 9, 8, 5, 8],
[8, 9, 6, 1, 2]]),
24: array([[8, 1, 5, 9, 7],
[1, 1, 6, 8, 3]]),
30: array([[ 7, 2, 8, 3, 4],
[ 5, 10, 5, 6, 5]]),
36: array([[ 6, 5, 9, 4, 5],
[ 6, 10, 8, 10, 10]]),
42: array([[ 2, 2, 6, 6, 9],
[ 5, 7, 4, 8, 10]]),
48: array([[10, 1, 1, 1, 8],
[ 5, 6, 4, 5, 8]]),
54: array([[ 6, 2, 6, 6, 6],
[10, 6, 9, 4, 8]]),
60: array([[ 6, 8, 2, 1, 7],
[ 1, 10, 10, 2, 3]]),
66: array([[4, 8, 7, 1, 3],
[1, 3, 8, 4, 2]]),
72: array([[ 8, 9, 8, 6, 7],
[ 1, 4, 2, 10, 1]]),
78: array([[ 2, 1, 3, 8, 9],
[ 2, 8, 6, 10, 3]]),
84: array([[ 1, 8, 9, 4, 3],
[ 9, 10, 8, 3, 4]]),
90: array([[ 4, 4, 2, 8, 6],
[ 1, 3, 10, 2, 6]]),
96: array([[ 9, 9, 8, 4, 4],
[ 2, 8, 10, 3, 5]])}
average(values,24)
{24: array([[ 6. , 7.75, 8.75, 6. , 5.75],
[ 8.5 , 5. , 5.5 , 2.5 , 3.75]]),
48: array([[ 5.75, 2.5 , 7. , 5.5 , 6.25],
[ 4.25, 7. , 5.75, 8. , 7. ]]),
72: array([[ 6.5 , 4.75, 4. , 2.25, 6. ],
[ 4.25, 6.25, 7.75, 3.75, 5.25]]),
96: array([[ 3.75, 5.5 , 5.5 , 6.5 , 6.25],
[ 3.25, 6.25, 6.5 , 6.25, 3.5 ]])}
Upvotes: 2