helloB
helloB

Reputation: 3582

Finding first event in one table that occurred after event in a second table, per row

I have a pandas DataFrame that looks like this:

email    signup_date    
[email protected]   7/21/16
[email protected]   6/6/16
[email protected]   5/5/16
[email protected]   4/4/16

I have a second pandas DataFrame with related events, when a signup actually got followed through on, that looks like this:

email    call_date    
[email protected]   7/25/16
[email protected]   6/20/16
[email protected]   5/4/16

There are a few things to keep in mind.

Ultimately the goal is to determine whether there is a call event that came after a sign up event for a given user but before the next sign up event for the same user, where the number of signup events per user is not known in advance.

Is there a pandas best practices way to do this? For now I'm using a for loop, and it's extremely slow (hasn't finished on 100,000 rows even after 20 minutes):

response_date = []
for i in range(signups.shape[0]):
    unique_id = signups.unique_id.values[i]
    start_date = signups.signup_date.values[i]
    end_date = signups.signup_date.values[-1]
    if end_date is start_date:
        end_date = end_date + pd.Timedelta('1 year')
    tmp_df = calls[calls.unique_id == unique_id]
    tmp_df = tmp_df[tmp_df.timestamp > start_date][tmp_df.timestamp < end_date]
    tmp_df = tmp_df.sort_values('timestamp')
    if tmp_df.shape[0] > 0 :
        response_date.append(tmp_df.timestamp.values[0])
    else:
        response_date.append(None)

Thanks for any advice!

Upvotes: 0

Views: 294

Answers (2)

jezrael
jezrael

Reputation: 862901

Another solution with sort_values and aggregate first:

df = df1.merge(df2)
df = df[df.signup_date <= df.call_date]
print (df.sort_values("signup_date", ascending=False)
         .groupby(['call_date', 'email'], as_index=False)
         .first())

   call_date    email signup_date
0 2016-05-04  [email protected]  2016-04-04
1 2016-06-20  [email protected]  2016-06-06
2 2016-07-25  [email protected]  2016-07-21

Upvotes: 1

piRSquared
piRSquared

Reputation: 294358

setup

from StringIO import StringIO
import pandas as pd

txt1 = """email    signup_date    
[email protected]   7/21/16
[email protected]   6/6/16
[email protected]   5/5/16
[email protected]   4/4/16"""

df1 = pd.read_csv(StringIO(txt1), parse_dates=[1], delim_whitespace=True)

txt2 = """email    call_date    
[email protected]   7/25/16
[email protected]   6/20/16
[email protected]   5/4/16"""

df2 = pd.read_csv(StringIO(txt2), parse_dates=[1], delim_whitespace=True)

merge
combine df1 and df2 on email

df = df1.merge(df2)
df

enter image description here

filter 1
get rid of rows where call_date is prior to signup_date

cond1 = df.signup_date.le(df.call_date)
cond1

0     True
1     True
2    False
3     True
4     True
dtype: bool

df = df[cond1]
df

enter image description here

filter 2
groupby ['email', 'call_date'] and get most recent with idxmax

most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
most_recent

call_date   email  
2016-05-04  [email protected]    4
2016-06-20  [email protected]    1
2016-07-25  [email protected]    0
Name: signup_date, dtype: int64

result

df.ix[most_recent]

enter image description here


all together

df = df1.merge(df2)
cond1 = df.signup_date.le(df.call_date)
df = df[cond1]
most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
df.ix[most_recent]

enter image description here

Upvotes: 1

Related Questions