Reputation: 424
have stuck with pandas.
I have df
which contains every transaction (first column - index, sorted by time):
email date
43487 aaa 2017-10-11 08:28:39
42910 bbb 2017-09-24 07:49:52
45561 bbb 2017-12-03 11:03:56
47212 bbb 2018-01-02 12:25:52
89734 ccc 2018-02-02 12:25:52
89734 ccc 2018-03-02 12:20:52
I also have df2
which contains unique emails and min date (df1
with drop_duplicates, since it was sorted by time, i got min date by default):
email date
43487 aaa 2017-10-11 08:28:39
42910 bbb 2017-09-24 07:49:52
89734 ccc 2018-02-02 12:25:52
How to create column date2
in df2
, containing the second min date for respective email in df1
?
I tried for
loop:
for email in df2['email']:
df2.at[email, 'date2'] = df1.loc[df1['email'] == email]['date'].iloc[1]
if len(df1.loc[df1['email'] == email]['date']) > 1 else None
But it is very long (55k rows and 32 GB RAM - no result in 5 min).
Desired output is:
email date date 2
43487 aaa 2017-10-11 08:28:39 None
42910 bbb 2017-09-24 07:49:52 2017-12-03 11:03:56
89734 ccc 2018-02-02 12:25:52 2018-03-02 12:20:52
Upvotes: 3
Views: 376
Reputation: 88236
You could use sort_values
to sort the dates
within each email
, which can be done using a list of columns to sort by.
Then you GroupBy
email
and use nth
to select the second element of each group.
date2 = df.sort_values(['email','date']).groupby('email').nth(1)
date
email
bbb 2017-12-03 11:03:56
ccc 2018-03-02 12:20:52
Finally left merge with df2
on email
:
df2.merge(date2, on = 'email', how = 'left')
email date_x date_y
0 aaa 2017-10-11 08:28:39 NaT
1 bbb 2017-09-24 07:49:52 2017-12-03 11:03:56
2 ccc 2018-02-02 12:25:52 2018-03-02 12:20:52
Upvotes: 2