Conditional groupby on dates using pandas

Question

I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.

What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.

The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.

import pandas as pd

#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})

customers

Which yields something like:

customerID	customerSeg	timestamp
1	1	1999-01-01
1	1	2000-01-01
1	1	2000-06-01
2	2	2001-01-01
2	2	2003-01-01
3	3	1999-01-01
4	4	2005-01-01
4	4	2008-01-01

The other dataframe contains interactions with that customer eg. at service and a phonecall.

interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions

Output:

customerID	timestamp	service	phonecall
1	1999-07-01	1	0
1	1999-11-01	0	1
1	2000-03-01	1	1
1	2001-04-01	0	1
2	2000-12-01	1	1
2	2002-01-01	0	1
2	2004-03-01	1	0
2	2004-05-01	1	1
4	2000-01-01	0	1
4	2004-01-01	1	0
4	2009-01-01	1	1

Result for all previous interactions (ideally, I would like only the last 5 years):

customerID	customerSeg	timestamp	service	phonecall
1	1	1999-01-01	0	0
1	1	2000-01-01	1	1
1	1	2000-06-01	2	2
2	2	2001-01-01	1	1
2	2	2003-01-01	1	2
3	3	1999-01-01	0	0
4	4	2005-01-01	1	1
4	4	2008-01-01	1	1

I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.

Gerd · Accepted Answer

I think you need several steps for transforming your data.

First of all, we convert the timestamp columns in both dataframes to datetime, so we can calculate the desired interval and do the comparisons:

customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])

After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):

customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)

Now we join the customers dataframe with the interactions dataframe on the customerID:

result = customers.merge(interactions, on='customerID', how='outer')

This yields

    customerID  customerSeg timestamp_x start_date timestamp_y  service  phonecall
0            1            1  1999-01-01 1994-01-01  1999-07-01      1.0        0.0
1            1            1  1999-01-01 1994-01-01  1999-11-01      0.0        1.0
2            1            1  1999-01-01 1994-01-01  2000-03-01      1.0        1.0
3            1            1  1999-01-01 1994-01-01  2001-04-01      0.0        1.0
4            1            2  2000-01-01 1995-01-01  1999-07-01      1.0        0.0
5            1            2  2000-01-01 1995-01-01  1999-11-01      0.0        1.0
6            1            2  2000-01-01 1995-01-01  2000-03-01      1.0        1.0
7            1            2  2000-01-01 1995-01-01  2001-04-01      0.0        1.0
...

Now here is how the condition is evaluated - what we want is that only those service and phonecall interactions will be used that are in rows that meet the condition (timestamp_y is in the interval between start_date and timestamp_x), so we replace the others by zero:

result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)

Finally we group the dataframe, summing up the service and phonecall interactions:

result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()

Result:

                                    service  phonecall
customerID timestamp_x customerSeg                    
1          1999-01-01  1                0.0        0.0
           2000-01-01  2                1.0        1.0
           2000-06-01  2                2.0        2.0
2          2001-01-01  1                1.0        1.0
           2003-01-01  2                1.0        2.0
3          1999-01-01  3                0.0        0.0
4          2005-01-01  3                1.0        1.0
           2008-01-01  3                1.0        0.0

(Note that your customerSeg data in the sample code seems not quite to match the data in the table.)

Conditional groupby on dates using pandas

Answers (2)

Related Questions