Reputation: 105
I have a two dataframes:
email_date_df:
email_id |customer_id| email_date | email_opened
001 | 1000 | 03-02-21 | 1
002 | 1001 | 03-22-21 | 0
003 | 1002 | 04-02-21 | 1
004 | 1003 | 05-02-21 | 1
transaction_df:
trans_id |customer_id| trans_date | amount
001 | 1000 | 03-04-21 | $10
002 | 1001 | 04-30-21 | $24
003 | 1001 | 05-02-21 | $14
004 | 1003 | 04-10-21 | $149
I want to understand for each email sent to a customer, was there a transaction that happened within 30 days. I merged on date frames based on customer_id
but have too much duplicate rows and data.
Is there a way I can search through transaction_df
for each row in email_date_df
to see if there was a transaction within 30 days?
The output would look like:
email_date_df:
email_id |customer_id| email_date | email_opened | transaction_witin_30_days
001 | 1000 | 03-02-21 | 1 | 1
002 | 1001 | 03-22-21 | 0 | 0
003 | 1002 | 04-02-21 | 1 | 0
004 | 1003 | 05-02-21 | 1 | 0
Upvotes: 1
Views: 43
Reputation: 8790
I think you were close with doing a merge based on customer_id
.
First, do a left merge of the two DataFrames. For each email a customer received, there will be a row for each transaction they logged:
merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
Second, find the differences between all the email dates and transaction dates. Also make a binary column indicating whether or not the difference is within 30 days:
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
Then, for each unique email (email_id
), check if any of the merged entries have a difference within 30 days. Do this with a group by:
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
Finally, map these grouped values back to the original data:
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)
Here is the result:
email_id customer_id email_date email_opened transaction_within_30_days
0 1 1000 2021-03-02 1 1
1 2 1001 2021-03-22 0 0
2 3 1002 2021-04-02 1 0
3 4 1003 2021-05-02 1 0
Code in full:
merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)
Inspection of some variables for clarification.
>>> merge
email_id customer_id email_date ... amount diff within30
0 1 1000 2021-03-02 ... 10.0 2 days True
1 2 1001 2021-03-22 ... 24.0 39 days False
2 2 1001 2021-03-22 ... 14.0 41 days False
3 3 1002 2021-04-02 ... NaN NaT False
4 4 1003 2021-05-02 ... 149.0 -22 days False
[5 rows x 9 columns]
>>> grouped
email_id
1 1
2 0
3 0
4 0
Name: within30, dtype: int64
Upvotes: 1