Reputation: 334
I would like to calculate by customer_id
the num_opens_at_campaign_send
. This is dependant on the number of campaigns a customer has opened prior to each campaign being sent.
I can't figure out the best approach to do this in pandas, so any help would be appreciated. I'm thinking a groupby customer_id
and an apply
function to compare each campaign_sent
date with all other dates in that column, but I'm unsure of the precise method to get the row counts to calculate the number of campaigns a customer has opened at the time of each campaign send.
The dataframe is as follows:
customer_id | campaign_id | campaign_sent | opened |
---|---|---|---|
a | 1234 | 2021-01-01 | True |
b | 1234 | 2021-01-01 | True |
c | 1234 | 2021-01-01 | False |
a | 2222 | 2021-02-01 | True |
b | 2222 | 2021-02-01 | False |
c | 2222 | 2021-02-01 | True |
a | 3333 | 2021-03-01 | True |
b | 3333 | 2021-03-01 | False |
c | 3333 | 2021-03-01 | True |
The desired output is:
customer_id | campaign_id | campaign_sent | num_opens_at_campaign_send |
---|---|---|---|
a | 1234 | 2021-01-01 | 0 |
b | 1234 | 2021-01-01 | 0 |
c | 1234 | 2021-01-01 | 0 |
a | 2222 | 2021-02-01 | 1 |
b | 2222 | 2021-02-01 | 1 |
c | 2222 | 2021-02-01 | 0 |
a | 3333 | 2021-03-01 | 2 |
b | 3333 | 2021-03-01 | 1 |
c | 3333 | 2021-03-01 | 1 |
So for the first campaign num_opens_at_campaign_send
is 0 for all because there were no campaigns prior to it.
As an example customer_id
'b' has 1 open when campaign_id
3333 is sent because they opened the first campaign (1234) but did not open the second campaign (2222) email.
Upvotes: 0
Views: 459
Reputation: 2128
You can use .groupby.transform
to compute the num_opens_at_campaign_send
column. First group by the customer_id
, now num_opens_at_campaign_send
for a row is cumulative sum of opened till the row
- value of opened in that row
.
To make sure that cumulative sum
is computed in the right order of date, first sort the dataframe by campaign_sent
column.
Use:
df = df.sort_values(by = ['campaign_sent'])
df['num_opens_at_campaign_send'] = ( df.groupby('customer_id')['opened']
.cumsum() - df.opened)
df.drop(columns = 'opened', inplace = True)
Output:
>>> df
customer_id campaign_id campaign_sent num_opens_at_campaign_send
0 a 1234 2021-01-01 0
1 b 1234 2021-01-01 0
2 c 1234 2021-01-01 0
3 a 2222 2021-02-01 1
4 b 2222 2021-02-01 1
5 c 2222 2021-02-01 0
6 a 3333 2021-03-01 2
7 b 3333 2021-03-01 1
8 c 3333 2021-03-01 1
Upvotes: 2