Finding first event in one table that occurred after event in a second table, per row

Question

I have a pandas DataFrame that looks like this:

email    signup_date    
a@a.com   7/21/16
b@b.com   6/6/16
d@d.com   5/5/16
b@b.com   4/4/16

I have a second pandas DataFrame with related events, when a signup actually got followed through on, that looks like this:

email    call_date    
a@a.com   7/25/16
b@b.com   6/20/16
b@b.com   5/4/16

There are a few things to keep in mind.

Some of the signup_dates may not have any later call_date for the corresponding email. That is, some people sign up but never get a call back.
Some people signup more than once and get called after each signup.
Some people signup more than once but don't get called after each signup.
Some people signup more than once but may only get called after several signups. In this case we want to attribute the call to the most recent signup.

Ultimately the goal is to determine whether there is a call event that came after a sign up event for a given user but before the next sign up event for the same user, where the number of signup events per user is not known in advance.

Is there a pandas best practices way to do this? For now I'm using a for loop, and it's extremely slow (hasn't finished on 100,000 rows even after 20 minutes):

response_date = []
for i in range(signups.shape[0]):
    unique_id = signups.unique_id.values[i]
    start_date = signups.signup_date.values[i]
    end_date = signups.signup_date.values[-1]
    if end_date is start_date:
        end_date = end_date + pd.Timedelta('1 year')
    tmp_df = calls[calls.unique_id == unique_id]
    tmp_df = tmp_df[tmp_df.timestamp > start_date][tmp_df.timestamp < end_date]
    tmp_df = tmp_df.sort_values('timestamp')
    if tmp_df.shape[0] > 0 :
        response_date.append(tmp_df.timestamp.values[0])
    else:
        response_date.append(None)

Thanks for any advice!

jezrael · Accepted Answer

Another solution with sort_values and aggregate first:

df = df1.merge(df2)
df = df[df.signup_date <= df.call_date]
print (df.sort_values("signup_date", ascending=False)
         .groupby(['call_date', 'email'], as_index=False)
         .first())

   call_date    email signup_date
0 2016-05-04  b@b.com  2016-04-04
1 2016-06-20  b@b.com  2016-06-06
2 2016-07-25  a@a.com  2016-07-21

Finding first event in one table that occurred after event in a second table, per row

Answers (2)

Related Questions