Aggregate Values by Month, Year or Day

Question

For instance, a list having sub-elements in the form: ['mm,dd,yyyy,hh,mm' 'value']:

A = [
    ['09,02,2011,09,34' 'apple'],
    ['09,05,2011,10,20' 'juice'],
    ['06,04,2012,09,38' 'juice'],
    ['06,04,2012,09,38' 'juice'],
    ['06,04,2012,09,40' 'apple'],
    ['06,04,2012,09,40' 'juice'],
    ['06,04,2012,09,42' 'green'],
    ['06,04,2012,23,08' 'juice'],
    ['06,04,2012,23,10' 'juice'],
    ['06,04,2012,23,10' 'ferrari'],
    ['07,03,2012,20,12' 'juice'],
    ['07,07,2012,21,03' 'people'],
    ['07,07,2012,21,04' 'juice'],
    ['07,07,2012,21,04' 'people'],
    ['07,07,2012,21,04' 'lime'],
    ['08,16,2012,08,55' 'juice'],
    ['08,16,2012,08,55' 'juice'],
    ['08,16,2012,08,55' 'lime'],
    ['08,16,2012,08,55' 'lime'],
    ['08,16,2012,08,56' 'juice'],
    ['08,16,2012,08,57' 'juice'],
    ['08,16,2012,08,57' 'andy'],
    ['01,16,2013,03,20' 'people'],
    ['02,10,2013,04,59' 'lime']
    ]

I am looking for a solid aggregation strategy that allows aggregating the values using month (1-12) and year (max_year to min_year) and monthly days(0-30). More precisely stating, for each of the values in the list sub-elements:

So, If aggregation type is year, then:

    out = [
        {
        'name': 'value1_name',
        'series': [['min_year', 'count']...['max_year', 'count']] 
        },
        {
        'name': 'value2_name',
        'series':[['min_year', 'count']...['max_year', 'count']] 
        }, ...

        ]

If aggregation type is month, then:

    out = [
        {
        'name': 'value1_name',
        'series': [['01', 'count']...['12', 'count']] 
        },
        {
        'name': 'value2_name',
        'series':[['02', 'count']...['12', 'count']] 
        }, ...

        ]

And if the aggregation type is provided as days, then:

    out = [
        {
        'name': 'value1_name',
        'series': [['01', 'count']...['30', 'count']] 
        },
        {
        'name': 'value2_name',
        'series':[['01', 'count']...['30', 'count']] 
        }, ...

        ]

The entire pain in this issue for me is to fill missing the month, day or year values for respective aggregation types. So, for instance, if the aggregation type is year, and I am aggregating all the 'juice' values, then:

    out = [
        {
        'name': 'juice',
        'series': [['2011', '1'],['2012', '11'],['2013', '0']] 
        },..

and same is the issue for month and days values. The key is that all the values ('apple', 'juice', etc.) should have series having equal lengths, which means that if 'ferrari' doesn't exist in 2011 and 2013 then its series should have ['2011', '0'] and ['2013', '0']. Similarly, if 'ferrari' doesn't exist in any month except june ('06'), then its series should be like:

  'series': [
['01', '0'],
['02', '0'],
['03', '0'],
['04', '0'],
['05', '0'],
['06', '1'],
['07', '0'],
['08', '0'],
['09', '0'],
['10', '0'],
['11', '0'],
['12', '0']
 ]

..and same is the case for days..

What is the best possible strategy I can have over here? Many thanks.

cge · Accepted Answer

So if you want a solution using Numpy, here's a compact bit of code for it:

# col is the column of your data to aggregate over
# minval and maxval are the minimum and maximum (inclusive)
# values they can take. Getting these to set automatically
# is a trivial task.
def agg(A, col, minval, maxval):
    D = np.array( [ [ int(x) for x in d.split(',') ] for d,t in A ] )
    V = np.array( [ t for d,t in A ] )
    dvals = range(minval,maxval+1)
    q = []
    for v in unique(V):
        q.append( { 'name': v, 'series': 
                 numpy.array([ [x,sum(V[D[...,col]==x]==v)] for x in dvals ]) } )
    return q

Essentially, this converts your data into numpy arrays, which allows you to use advanced indexing to easily aggregate the data. D contains the dates, V the values, so, for example, V[D[...,1]==2] gives you every value that has a day (column 1) of 2.

Note that this version is not necessarily the fasted. In particular, the conversion to numpy arrays is likely somewhat slow for large data sets, and the method of iterating over potential values is very compact but not necessarily the fastest method. For the former, you could either create empty arrays and populate them with your data, or get your data in a manner more amenable to something like fromstring or loadtxt. For the latter, I'm not quite sure what the optimal algorithm would be.

Aggregate Values by Month, Year or Day

Answers (1)

Related Questions