Reputation: 760
I have two dataframes - one that contains transaction/order info and another that contains when emails from a campaign are sent out.
import pandas as pd
email_data = {'Email' : ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'email_sent_date' : ['2019-08-05', '2019-08-01', '2019-09-09', '2019-09-05'],
'CampaignName' : ['Campaign1', 'Campaign2', 'Campaign2', 'Campaign1']}
email_df = pd.DataFrame(email_data)
transaction_data = {'Email' : ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'order_date' : ['2019-09-05', '2019-09-10', '2019-09-05', '2019-09-10']}
transaction_df = pd.DataFrame(transaction_data)
I am trying to answer the question - for a given transaction, what was the most recent email sent prior to that transaction?
I have a solution using pd.apply, but isn't particularly fast and probably isn't sustainable for larger transaction volumes. I'm wondering if there is a better way to do this.
email_df.sort_values(by='email_sent_date', inplace=True)
def find_recent_email(x,df):
#df should be the email_df, sorted by customer_email and date, in ascending order
#so the most recent email is last
return df[df['Email'] == x]['email_sent_date'].tail(1).values[0]
transaction_df['recent_email_date'] = pd.to_datetime(transaction_df['Email'].apply(find_recent_email, args=(email_df,)))
Upvotes: 2
Views: 293
Reputation: 75080
First convert the columns to datetime (if already datetime-ignore this step):
email_df.email_sent_date=pd.to_datetime(email_df.email_sent_date)
transaction_df.order_date=pd.to_datetime(transaction_df.order_date)
Sorting the dates in both dataframes:
a=email_df.sort_values('email_sent_date').drop('CampaignName',1)
b=transaction_df.sort_values('order_date')
using merge_asof
we can first merge on Email
using the by=
param, then perform the asof merge using direction='nearest'
:
pd.merge_asof(a,b,by='Email',left_on='email_sent_date',
right_on='order_date',direction='nearest')
Email email_sent_date order_date
0 [email protected] 2019-08-01 2019-09-05
1 [email protected] 2019-08-05 2019-09-05
2 [email protected] 2019-09-05 2019-09-05
3 [email protected] 2019-09-09 2019-09-10
Params used:
by : column name or list of column names Match on these columns before performing merge operation.
left_by : column name Field names to match on in the left DataFrame.
right_by : column name Field names to match on in the right DataFrame.
direction : ‘backward’ (default), ‘forward’, or ‘nearest’ Whether to search for prior, subsequent, or closest matches.
You can also take a look at tolerance
param which reads as:
tolerance : integer or Timedelta, optional, default None Select asof tolerance within this range; must be compatible with the merge index.
Upvotes: 2