Andreas Steinberg
Andreas Steinberg

Reputation: 111

Conditional groupby on dates using pandas

I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.

What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.

The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.

import pandas as pd

#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})

customers

Which yields something like:

customerID customerSeg timestamp
1 1 1999-01-01
1 1 2000-01-01
1 1 2000-06-01
2 2 2001-01-01
2 2 2003-01-01
3 3 1999-01-01
4 4 2005-01-01
4 4 2008-01-01

The other dataframe contains interactions with that customer eg. at service and a phonecall.

interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions

Output:

customerID timestamp service phonecall
1 1999-07-01 1 0
1 1999-11-01 0 1
1 2000-03-01 1 1
1 2001-04-01 0 1
2 2000-12-01 1 1
2 2002-01-01 0 1
2 2004-03-01 1 0
2 2004-05-01 1 1
4 2000-01-01 0 1
4 2004-01-01 1 0
4 2009-01-01 1 1

Result for all previous interactions (ideally, I would like only the last 5 years):

customerID customerSeg timestamp service phonecall
1 1 1999-01-01 0 0
1 1 2000-01-01 1 1
1 1 2000-06-01 2 2
2 2 2001-01-01 1 1
2 2 2003-01-01 1 2
3 3 1999-01-01 0 0
4 4 2005-01-01 1 1
4 4 2008-01-01 1 1

I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.

Upvotes: 0

Views: 629

Answers (2)

sammywemmy
sammywemmy

Reputation: 28729

One option is to use the conditional_join from pyjanitor to compute the rows that match the criteria, before grouping and summing:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)

(interactions
 .conditional_join(
     customers,
     # column from left, column from right, comparison operator
     ('timestamp', 'timestamp', '<='),
     ('timestamp', 'start_date', '>='),
     ('customerID', 'customerID', '=='),
     how='right')
  # drop irrelevant columns
 .drop(columns=[('left', 'customerID'), 
                ('left', 'timestamp'), 
                ('right', 'start_date')])
  # return to single index
 .droplevel(0,1)
 .groupby(['customerID', 'customerSeg', 'timestamp'])
 .sum()
)

                                   service  phonecall
customerID customerSeg timestamp                     
1          1           1999-01-01      0.0        0.0
           2           2000-01-01      1.0        1.0
                       2000-06-01      2.0        2.0
2          1           2001-01-01      1.0        1.0
           2           2003-01-01      1.0        2.0
3          3           1999-01-01      0.0        0.0
4          3           2005-01-01      1.0        1.0
                       2008-01-01      1.0        0.0


Upvotes: 0

Gerd
Gerd

Reputation: 2813

I think you need several steps for transforming your data.

First of all, we convert the timestamp columns in both dataframes to datetime, so we can calculate the desired interval and do the comparisons:

customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])

After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):

customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)

Now we join the customers dataframe with the interactions dataframe on the customerID:

result = customers.merge(interactions, on='customerID', how='outer')

This yields

    customerID  customerSeg timestamp_x start_date timestamp_y  service  phonecall
0            1            1  1999-01-01 1994-01-01  1999-07-01      1.0        0.0
1            1            1  1999-01-01 1994-01-01  1999-11-01      0.0        1.0
2            1            1  1999-01-01 1994-01-01  2000-03-01      1.0        1.0
3            1            1  1999-01-01 1994-01-01  2001-04-01      0.0        1.0
4            1            2  2000-01-01 1995-01-01  1999-07-01      1.0        0.0
5            1            2  2000-01-01 1995-01-01  1999-11-01      0.0        1.0
6            1            2  2000-01-01 1995-01-01  2000-03-01      1.0        1.0
7            1            2  2000-01-01 1995-01-01  2001-04-01      0.0        1.0
...

Now here is how the condition is evaluated - what we want is that only those service and phonecall interactions will be used that are in rows that meet the condition (timestamp_y is in the interval between start_date and timestamp_x), so we replace the others by zero:

result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)

Finally we group the dataframe, summing up the service and phonecall interactions:

result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()

Result:

                                    service  phonecall
customerID timestamp_x customerSeg                    
1          1999-01-01  1                0.0        0.0
           2000-01-01  2                1.0        1.0
           2000-06-01  2                2.0        2.0
2          2001-01-01  1                1.0        1.0
           2003-01-01  2                1.0        2.0
3          1999-01-01  3                0.0        0.0
4          2005-01-01  3                1.0        1.0
           2008-01-01  3                1.0        0.0

(Note that your customerSeg data in the sample code seems not quite to match the data in the table.)

Upvotes: 1

Related Questions