Reputation: 3582
I have a pandas DataFrame that looks like this:
email signup_date
[email protected] 7/21/16
[email protected] 6/6/16
[email protected] 5/5/16
[email protected] 4/4/16
I have a second pandas DataFrame with related events, when a signup actually got followed through on, that looks like this:
email call_date
[email protected] 7/25/16
[email protected] 6/20/16
[email protected] 5/4/16
There are a few things to keep in mind.
signup_date
s may not have any later call_date
for the corresponding email. That is, some people sign up but never get a call back.Ultimately the goal is to determine whether there is a call event that came after a sign up event for a given user but before the next sign up event for the same user, where the number of signup events per user is not known in advance.
Is there a pandas
best practices way to do this? For now I'm using a for loop, and it's extremely slow (hasn't finished on 100,000 rows even after 20 minutes):
response_date = []
for i in range(signups.shape[0]):
unique_id = signups.unique_id.values[i]
start_date = signups.signup_date.values[i]
end_date = signups.signup_date.values[-1]
if end_date is start_date:
end_date = end_date + pd.Timedelta('1 year')
tmp_df = calls[calls.unique_id == unique_id]
tmp_df = tmp_df[tmp_df.timestamp > start_date][tmp_df.timestamp < end_date]
tmp_df = tmp_df.sort_values('timestamp')
if tmp_df.shape[0] > 0 :
response_date.append(tmp_df.timestamp.values[0])
else:
response_date.append(None)
Thanks for any advice!
Upvotes: 0
Views: 294
Reputation: 862901
Another solution with sort_values
and aggregate first
:
df = df1.merge(df2)
df = df[df.signup_date <= df.call_date]
print (df.sort_values("signup_date", ascending=False)
.groupby(['call_date', 'email'], as_index=False)
.first())
call_date email signup_date
0 2016-05-04 [email protected] 2016-04-04
1 2016-06-20 [email protected] 2016-06-06
2 2016-07-25 [email protected] 2016-07-21
Upvotes: 1
Reputation: 294358
setup
from StringIO import StringIO
import pandas as pd
txt1 = """email signup_date
[email protected] 7/21/16
[email protected] 6/6/16
[email protected] 5/5/16
[email protected] 4/4/16"""
df1 = pd.read_csv(StringIO(txt1), parse_dates=[1], delim_whitespace=True)
txt2 = """email call_date
[email protected] 7/25/16
[email protected] 6/20/16
[email protected] 5/4/16"""
df2 = pd.read_csv(StringIO(txt2), parse_dates=[1], delim_whitespace=True)
merge
combine df1
and df2
on email
df = df1.merge(df2)
df
filter 1
get rid of rows where call_date
is prior to signup_date
cond1 = df.signup_date.le(df.call_date)
cond1
0 True
1 True
2 False
3 True
4 True
dtype: bool
df = df[cond1]
df
filter 2
groupby
['email', 'call_date']
and get most recent with idxmax
most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
most_recent
call_date email
2016-05-04 [email protected] 4
2016-06-20 [email protected] 1
2016-07-25 [email protected] 0
Name: signup_date, dtype: int64
result
df.ix[most_recent]
all together
df = df1.merge(df2)
cond1 = df.signup_date.le(df.call_date)
df = df[cond1]
most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
df.ix[most_recent]
Upvotes: 1