Reputation: 111
I'm in a bit of a pickle. I've been working on a problem all day without seeing any real results. I'm working in Python and using Pandas for handling data.
What I'm trying to achieve is based on the customers previous interactions to sum each type of interaction. The timestamp of the interaction should be less than the timestamp of the survey. Ideally, I would like to sum the interactions for the customer during some period - like less than e.g. 5 years.
The first dataframe contains a customer ID, segmentation of that customer during in that survey e.g. 1 being "happy", 2 being "sad" and a timestamp for the time of the recorded segment or time of that survey.
import pandas as pd
#Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})
customers
Which yields something like:
customerID | customerSeg | timestamp |
---|---|---|
1 | 1 | 1999-01-01 |
1 | 1 | 2000-01-01 |
1 | 1 | 2000-06-01 |
2 | 2 | 2001-01-01 |
2 | 2 | 2003-01-01 |
3 | 3 | 1999-01-01 |
4 | 4 | 2005-01-01 |
4 | 4 | 2008-01-01 |
The other dataframe contains interactions with that customer eg. at service and a phonecall.
interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
interactions
Output:
customerID | timestamp | service | phonecall |
---|---|---|---|
1 | 1999-07-01 | 1 | 0 |
1 | 1999-11-01 | 0 | 1 |
1 | 2000-03-01 | 1 | 1 |
1 | 2001-04-01 | 0 | 1 |
2 | 2000-12-01 | 1 | 1 |
2 | 2002-01-01 | 0 | 1 |
2 | 2004-03-01 | 1 | 0 |
2 | 2004-05-01 | 1 | 1 |
4 | 2000-01-01 | 0 | 1 |
4 | 2004-01-01 | 1 | 0 |
4 | 2009-01-01 | 1 | 1 |
Result for all previous interactions (ideally, I would like only the last 5 years):
customerID | customerSeg | timestamp | service | phonecall |
---|---|---|---|---|
1 | 1 | 1999-01-01 | 0 | 0 |
1 | 1 | 2000-01-01 | 1 | 1 |
1 | 1 | 2000-06-01 | 2 | 2 |
2 | 2 | 2001-01-01 | 1 | 1 |
2 | 2 | 2003-01-01 | 1 | 2 |
3 | 3 | 1999-01-01 | 0 | 0 |
4 | 4 | 2005-01-01 | 1 | 1 |
4 | 4 | 2008-01-01 | 1 | 1 |
I've tried almost everything, I could come up with. So, I would really appreciate some inputs. I'm pretty much confined to using Pandas and Python, since it's the language, I'm most familiar with, but also because I need to read a csv file of the customer segmentation.
Upvotes: 0
Views: 629
Reputation: 28729
One option is to use the conditional_join from pyjanitor to compute the rows that match the criteria, before grouping and summing:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
(interactions
.conditional_join(
customers,
# column from left, column from right, comparison operator
('timestamp', 'timestamp', '<='),
('timestamp', 'start_date', '>='),
('customerID', 'customerID', '=='),
how='right')
# drop irrelevant columns
.drop(columns=[('left', 'customerID'),
('left', 'timestamp'),
('right', 'start_date')])
# return to single index
.droplevel(0,1)
.groupby(['customerID', 'customerSeg', 'timestamp'])
.sum()
)
service phonecall
customerID customerSeg timestamp
1 1 1999-01-01 0.0 0.0
2 2000-01-01 1.0 1.0
2000-06-01 2.0 2.0
2 1 2001-01-01 1.0 1.0
2 2003-01-01 1.0 2.0
3 3 1999-01-01 0.0 0.0
4 3 2005-01-01 1.0 1.0
2008-01-01 1.0 0.0
Upvotes: 0
Reputation: 2813
I think you need several steps for transforming your data.
First of all, we convert the timestamp
columns in both dataframes to datetime
, so we can calculate the desired interval and do the comparisons:
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
After that, we create a new column that contains that start date (e.g. 5 years before the timestamp):
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
Now we join the customers
dataframe with the interactions
dataframe on the customerID
:
result = customers.merge(interactions, on='customerID', how='outer')
This yields
customerID customerSeg timestamp_x start_date timestamp_y service phonecall
0 1 1 1999-01-01 1994-01-01 1999-07-01 1.0 0.0
1 1 1 1999-01-01 1994-01-01 1999-11-01 0.0 1.0
2 1 1 1999-01-01 1994-01-01 2000-03-01 1.0 1.0
3 1 1 1999-01-01 1994-01-01 2001-04-01 0.0 1.0
4 1 2 2000-01-01 1995-01-01 1999-07-01 1.0 0.0
5 1 2 2000-01-01 1995-01-01 1999-11-01 0.0 1.0
6 1 2 2000-01-01 1995-01-01 2000-03-01 1.0 1.0
7 1 2 2000-01-01 1995-01-01 2001-04-01 0.0 1.0
...
Now here is how the condition is evaluated - what we want is that only those service
and phonecall
interactions will be used that are in rows that meet the condition (timestamp_y
is in the interval between start_date
and timestamp_x
), so we replace the others by zero:
result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
Finally we group the dataframe, summing up the service
and phonecall
interactions:
result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()
Result:
service phonecall
customerID timestamp_x customerSeg
1 1999-01-01 1 0.0 0.0
2000-01-01 2 1.0 1.0
2000-06-01 2 2.0 2.0
2 2001-01-01 1 1.0 1.0
2003-01-01 2 1.0 2.0
3 1999-01-01 3 0.0 0.0
4 2005-01-01 3 1.0 1.0
2008-01-01 3 1.0 0.0
(Note that your customerSeg
data in the sample code seems not quite to match the data in the table.)
Upvotes: 1