ctd25
ctd25

Reputation: 760

Create new column based on date condition in another dataframe

I have two dataframes - one that contains transaction/order info and another that contains when emails from a campaign are sent out.

import pandas as pd
email_data = {'Email' : ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
                     'email_sent_date' : ['2019-08-05', '2019-08-01', '2019-09-09', '2019-09-05'],
                     'CampaignName' : ['Campaign1', 'Campaign2', 'Campaign2', 'Campaign1']}

email_df = pd.DataFrame(email_data)

transaction_data = {'Email' : ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
         'order_date' : ['2019-09-05', '2019-09-10', '2019-09-05', '2019-09-10']}

transaction_df = pd.DataFrame(transaction_data)

I am trying to answer the question - for a given transaction, what was the most recent email sent prior to that transaction?

I have a solution using pd.apply, but isn't particularly fast and probably isn't sustainable for larger transaction volumes. I'm wondering if there is a better way to do this.

email_df.sort_values(by='email_sent_date', inplace=True)

def find_recent_email(x,df):
    #df should be the email_df, sorted by customer_email and date, in ascending order
    #so the most recent email is last
    return df[df['Email'] == x]['email_sent_date'].tail(1).values[0]

transaction_df['recent_email_date'] = pd.to_datetime(transaction_df['Email'].apply(find_recent_email, args=(email_df,)))

Upvotes: 2

Views: 293

Answers (1)

anky
anky

Reputation: 75080

First convert the columns to datetime (if already datetime-ignore this step):

email_df.email_sent_date=pd.to_datetime(email_df.email_sent_date)
transaction_df.order_date=pd.to_datetime(transaction_df.order_date)

Sorting the dates in both dataframes:

a=email_df.sort_values('email_sent_date').drop('CampaignName',1)
b=transaction_df.sort_values('order_date')

using merge_asof we can first merge on Email using the by= param, then perform the asof merge using direction='nearest':

pd.merge_asof(a,b,by='Email',left_on='email_sent_date',
                   right_on='order_date',direction='nearest')

             Email email_sent_date order_date
0   [email protected]      2019-08-01 2019-09-05
1   [email protected]      2019-08-05 2019-09-05
2  [email protected]      2019-09-05 2019-09-05
3  [email protected]      2019-09-09 2019-09-10

Params used:

by : column name or list of column names Match on these columns before performing merge operation.

left_by : column name Field names to match on in the left DataFrame.

right_by : column name Field names to match on in the right DataFrame.

direction : ‘backward’ (default), ‘forward’, or ‘nearest’ Whether to search for prior, subsequent, or closest matches.

You can also take a look at tolerance param which reads as:

tolerance : integer or Timedelta, optional, default None Select asof tolerance within this range; must be compatible with the merge index.

Upvotes: 2

Related Questions