TheDataFox
TheDataFox

Reputation: 334

Row Count in Group By if Date is Before the Current Row being Evaluated

I would like to calculate by customer_id the num_opens_at_campaign_send. This is dependant on the number of campaigns a customer has opened prior to each campaign being sent.

I can't figure out the best approach to do this in pandas, so any help would be appreciated. I'm thinking a groupby customer_id and an apply function to compare each campaign_sent date with all other dates in that column, but I'm unsure of the precise method to get the row counts to calculate the number of campaigns a customer has opened at the time of each campaign send.

The dataframe is as follows:

customer_id campaign_id campaign_sent opened
a 1234 2021-01-01 True
b 1234 2021-01-01 True
c 1234 2021-01-01 False
a 2222 2021-02-01 True
b 2222 2021-02-01 False
c 2222 2021-02-01 True
a 3333 2021-03-01 True
b 3333 2021-03-01 False
c 3333 2021-03-01 True

The desired output is:

customer_id campaign_id campaign_sent num_opens_at_campaign_send
a 1234 2021-01-01 0
b 1234 2021-01-01 0
c 1234 2021-01-01 0
a 2222 2021-02-01 1
b 2222 2021-02-01 1
c 2222 2021-02-01 0
a 3333 2021-03-01 2
b 3333 2021-03-01 1
c 3333 2021-03-01 1

So for the first campaign num_opens_at_campaign_send is 0 for all because there were no campaigns prior to it.

As an example customer_id 'b' has 1 open when campaign_id 3333 is sent because they opened the first campaign (1234) but did not open the second campaign (2222) email.

Upvotes: 0

Views: 459

Answers (1)

Amit Vikram Singh
Amit Vikram Singh

Reputation: 2128

You can use .groupby.transform to compute the num_opens_at_campaign_send column. First group by the customer_id, now num_opens_at_campaign_send for a row is cumulative sum of opened till the row - value of opened in that row.

To make sure that cumulative sum is computed in the right order of date, first sort the dataframe by campaign_sent column.

Use:

df = df.sort_values(by = ['campaign_sent'])
df['num_opens_at_campaign_send'] = ( df.groupby('customer_id')['opened']
                                       .cumsum() - df.opened)
df.drop(columns = 'opened', inplace = True)

Output:

>>> df
  customer_id  campaign_id campaign_sent  num_opens_at_campaign_send
0           a         1234    2021-01-01                           0
1           b         1234    2021-01-01                           0
2           c         1234    2021-01-01                           0
3           a         2222    2021-02-01                           1
4           b         2222    2021-02-01                           1
5           c         2222    2021-02-01                           0
6           a         3333    2021-03-01                           2
7           b         3333    2021-03-01                           1
8           c         3333    2021-03-01                           1

Upvotes: 2

Related Questions