Timeseries average with python

Question

I looked at Panda and it might not be easy of use for my purpose. Probably there is a python toolkit out there and I'm not aware of. Could you tell me what package could easily deal with situations like the one described below?

I have a series of 2D numpy arrays of instant values of a meteorological variable (for a a geographical area), sampled at hours 0, 6, 12, ..., 96,... for a specific date.

I have it as a python dictionary:

values[0]:[[3, 2,...,9, 5][6, 7, ..., 6, 7]]
...
...
values[96]:[[2, 2,...,8, 5][6, 7, ..., 5, 6]]

I need to average on a configurable aggregation step. For example, for an aggregation step of 24 hours I would obtain 4 daily averages:

values_avg24h=average(values, aggr_step=24)
values_avg24h[24]=[[...][...]]
values_avg24h[48]=[[...][...]]
values_avg24h[72]=[[...][...]]
values_avg24h[96]=[[...][...]]

The time resolution can change. Also, original values can be averaged as well but on a different aggregation step. A specific existing package could solve all of that. Anyway, even a smart solution to this simple problem would be appreciated.

spencerlyon2 · Accepted Answer

If the data are sampled at a regular frequency (which your post seems to indicate) then pandas certainly can help you.

One possible pandas solution is to create a Panel object where the items (like keys in your dictionary) are of class pandas.tseries.index.DatetimeIndex. The major_axis and minor_axis elements of the Panel will then be the rows, and columns in your numpy arrays. This sounds a bit strange, so I will show a quick example.

>>> import pandas as pd
>>> import numpy as np
>>> start_date = '4/14/2013'  # Start today (default time is midnight)

# create the date_range we will use as panel items
>>> ind = pd.date_range(start=start_date, periods=20, freq='6H')

# Create data: thanks @mtadd
>>> values = {h:np.array([np.random.random_integers(1,10,5) for _ in 'x'*2])
              for h in np.r_[0:120:6]}

# Create the Panel object directly from the dict of 2D np.arrays
>>> my_panel = pd.Panel(values)

# Set the Panel's items to be the date_range we made earlier
>>> my_panel.items = ind

>>> my_panel.ix[0]  # Show first 2D array of data
        0   1   2   3  4
     0  4   3  10   6  6
     1  4  10   7  10  6
>>> values[0]  # Same as above, but from the dict
    array([[ 4,  3, 10,  6,  6],
           [ 4, 10,  7, 10,  6]])

Now that we have things set up we will be using the Panel method resample. We will supply two arguments, the first will be the new frequency we want the data to be at and the second is a keyword argument how that we wil specify as mean. This is the main part of my answer!

>>> averaged = my_panel.resample('24H', how='mean')
>>> list(averaged.items)  # Show new time stamps for data
    [,
     ,
     ,
     ,
     ]

Note that the '24H' parameter we gave the resample method was simply an extension of the example posted by the OA. See this link for more information on what types of descriptions this argument can take.

To verify that the mean was calculated correctly we will compare the first item in the averaged Panel to the mean we calculate by hand.

>>> averaged.ix[0]
         0     1     2     3     4
    0  6.5  5.25  7.25  6.25  6.50
    1  5.0  6.75  5.25  6.50  5.25

>>> (values[0] + values[6] + values[12] + values[18]) / 4.
    array([[ 6.5 ,  5.25,  7.25,  6.25,  6.5 ],
           [ 5.  ,  6.75,  5.25,  6.5 ,  5.25]])

You can have much more flexibility over the time stamps for the items if you pass a datetime.datetime object in. For example, if you started sampling on New Years's Day at 1:37 PM and continued sampling for 50 periods you could do this:

>>> from datetime import datetime
>>> start = datetime(2013, 1, 1, 13, 37)  # 1:37 on 1/1/2013
>>> ind = pd.date_range(start, periods=50, freq='6H')

Then you would carry on as before. You can also pass starting and ending items to date_range instead of the number of periods you want it to generate. If we collected data every 2 1/2 hours from the same starting time as above until 4:50 AM on February 28th, you could do the following:

>>> end = datetime(2013, 2, 28, 4, 50)  # 4:50 AM on 2/28/2013
>>> ind = pd.date_range(start, freq='2H 30MIN')

Notice that you don't have to pass starting and ending times that align perfectly with the frequency you passed in. For more details on how to work with the Panel object or time series data in pandas check out the links given in those words.

Timeseries average with python

Answers (2)

Related Questions