dsexplorer
dsexplorer

Reputation: 105

Combine dataframes based on ID and Date within a timeframe

I have a two dataframes:

email_date_df:

 email_id |customer_id| email_date  | email_opened
    001   |  1000     | 03-02-21    |       1
    002   |  1001     | 03-22-21    |       0
    003   |  1002     | 04-02-21    |       1
    004   |  1003     | 05-02-21    |       1

transaction_df:

 trans_id |customer_id| trans_date  | amount
    001   |  1000     | 03-04-21    |   $10
    002   |  1001     | 04-30-21    |   $24
    003   |  1001     | 05-02-21    |   $14
    004   |  1003     | 04-10-21    |   $149

I want to understand for each email sent to a customer, was there a transaction that happened within 30 days. I merged on date frames based on customer_id but have too much duplicate rows and data.

Is there a way I can search through transaction_df for each row in email_date_df to see if there was a transaction within 30 days?

The output would look like:

email_date_df:

 email_id |customer_id| email_date  | email_opened  | transaction_witin_30_days
    001   |  1000     | 03-02-21    |       1       |       1
    002   |  1001     | 03-22-21    |       0       |       0
    003   |  1002     | 04-02-21    |       1       |       0
    004   |  1003     | 05-02-21    |       1       |       0

Upvotes: 1

Views: 43

Answers (1)

Tom
Tom

Reputation: 8790

I think you were close with doing a merge based on customer_id.

First, do a left merge of the two DataFrames. For each email a customer received, there will be a row for each transaction they logged:

merge = email_date_df.merge(transaction_df, on='customer_id', how='left')

Second, find the differences between all the email dates and transaction dates. Also make a binary column indicating whether or not the difference is within 30 days:

merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')

Then, for each unique email (email_id), check if any of the merged entries have a difference within 30 days. Do this with a group by:

grouped = merge.groupby(['email_id'])['within30'].any().astype(int)

Finally, map these grouped values back to the original data:

email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)

Here is the result:

   email_id  customer_id email_date  email_opened  transaction_within_30_days
0         1         1000 2021-03-02             1                           1
1         2         1001 2021-03-22             0                           0
2         3         1002 2021-04-02             1                           0
3         4         1003 2021-05-02             1                           0


Code in full:

merge = email_date_df.merge(transaction_df, on='customer_id', how='left')
merge['diff'] = (merge['trans_date'] - merge['email_date'])
merge['within30'] = (merge['diff'] < '30D') & (merge['diff'] > '0D')
grouped = merge.groupby(['email_id'])['within30'].any().astype(int)
email_date_df['transaction_within_30_days'] = email_date_df['email_id'].map(grouped)

Inspection of some variables for clarification.

>>> merge
   email_id  customer_id email_date  ...  amount     diff within30
0         1         1000 2021-03-02  ...    10.0   2 days     True
1         2         1001 2021-03-22  ...    24.0  39 days    False
2         2         1001 2021-03-22  ...    14.0  41 days    False
3         3         1002 2021-04-02  ...     NaN      NaT    False
4         4         1003 2021-05-02  ...   149.0 -22 days    False

[5 rows x 9 columns]

>>> grouped
email_id
1    1
2    0
3    0
4    0
Name: within30, dtype: int64

Upvotes: 1

Related Questions