Python Pandas data aggregation by time interval

Question

I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)

This is an example of my data:

            desc        date_1      date_2      date_3      values
54287171    cc-cc       2018-03-14  2017-07-03  2018-05-21  55
49410141    other-dd    2012-01-18  2017-01-26  2011-12-30  17
37694577    other-dd    2018-07-05  2017-07-25  2018-06-19  9
54051782    other-cc    2014-10-23  2017-11-24  2014-10-31  37
7378464     dd-cc       2016-08-05  2018-05-15  2016-07-22  92
29665541    dd-cc       2011-12-14  2017-08-01  2012-05-01  40
2999878     dd-cc       2018-10-03  2018-04-13  2018-09-17  37
39453869    cc-cc       2015-11-24  2017-09-09  2015-11-21  81
7181109     dd-dd       2018-01-18  2017-11-24  2018-01-15  27
29580865    dd-cc       2017-04-24  2017-09-07  2017-05-04  38
14778957    other-cc    2017-11-02  2017-06-20  2018-06-26  49
32500886    cc-dd       2017-01-12  2017-05-26  2017-01-12  50
52146154    other-cc    2018-08-01  2017-03-27  2018-07-16  5
7208584     cc-dd       2018-03-13  2018-07-04  2018-04-26  8
35894666    cc-cc       2017-12-04  2018-06-13  2018-08-14  88
27565108    other-other 2015-10-19  2017-03-14  2016-01-22  88
50705834    other-cc    2018-01-08  2017-12-09  2018-01-11  62
45420360    dd-cc       2017-10-23  2017-09-02  2018-01-29  52
55933497    dd-cc       2017-04-14  2018-06-07  2017-09-27  36
46160680    dd-cc       2014-06-05  2018-01-16  2016-01-27  87

In brief, I'm trying to recreate is this functionality:

SUM(CASE 
WHEN date_1 <= date_2 - interval '11' month
AND  date_3 > date_2 - interval '11' month
THEN values
end)

But then also group by the desc column.

So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:

"date_2" is the date of an event happening and I'm trying to sum up the values at that point in time. Is this just achieved with groupby?
I'm then trying to add in the interval in monthly increments, i.e. the same function, summing the values but 12 months prior to date_2, then 11 months prior to date_2, then 10 months prior to date_2.

An example output should essentially look like this:

    desc        interval         values_sum
    cc-cc       at_date          55
    cc-dd       at_date          17
    other-dd    at_date          9
    cc-cc       date_minus_1     37
    cc-dd       date_minus_1     92
    other-dd    date_minus_1     40               
    cc-cc       date_minus_2     37

roastbeeef · Accepted Answer

Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.

from dateutil.relativedelta import relativedelta
for i in np.arange(-12,12,1):
      df['Month_' + str(i)] = df.apply(lambda x: x['values'] 
                                       if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) \
                                                 & (x['date_3'] > x['date_2'] + relativedelta(months=i)) 
                                       else 0, axis=1)

Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()

Python Pandas data aggregation by time interval

Answers (2)

Related Questions