Pandas assign second min value to column

Question

have stuck with pandas. I have df which contains every transaction (first column - index, sorted by time):

        email    date
43487   aaa     2017-10-11 08:28:39
42910   bbb     2017-09-24 07:49:52
45561   bbb     2017-12-03 11:03:56
47212   bbb     2018-01-02 12:25:52
89734   ccc     2018-02-02 12:25:52
89734   ccc     2018-03-02 12:20:52

I also have df2 which contains unique emails and min date (df1 with drop_duplicates, since it was sorted by time, i got min date by default):

        email    date
43487   aaa     2017-10-11 08:28:39
42910   bbb     2017-09-24 07:49:52
89734   ccc     2018-02-02 12:25:52

How to create column date2 in df2, containing the second min date for respective email in df1?

I tried for loop:

for email in df2['email']:
    df2.at[email, 'date2'] = df1.loc[df1['email'] == email]['date'].iloc[1] 
    if len(df1.loc[df1['email'] == email]['date']) > 1 else None

But it is very long (55k rows and 32 GB RAM - no result in 5 min).

Desired output is:

        email   date                date 2
43487   aaa     2017-10-11 08:28:39 None
42910   bbb     2017-09-24 07:49:52 2017-12-03 11:03:56
89734   ccc     2018-02-02 12:25:52 2018-03-02 12:20:52

yatu · Accepted Answer

You could use sort_values to sort the dates within each email, which can be done using a list of columns to sort by.

Then you GroupBy email and use nth to select the second element of each group.

date2 = df.sort_values(['email','date']).groupby('email').nth(1)

             date
email                    
bbb   2017-12-03 11:03:56
ccc   2018-03-02 12:20:52

Finally left merge with df2 on email:

df2.merge(date2, on = 'email', how = 'left')

    email        date_x              date_y
0   aaa 2017-10-11 08:28:39                 NaT
1   bbb 2017-09-24 07:49:52 2017-12-03 11:03:56
2   ccc 2018-02-02 12:25:52 2018-03-02 12:20:52

Pandas assign second min value to column

Answers (1)

Related Questions