How to remove duplicate entires using the latest time in Pandas

Question

Here is the snippet:

test = pd.DataFrame({'uid':[1,1,2,2,3,3], 
                     'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
                     'amount': [10,11,12,13,14,15]})

Output:

      amount start_time uid
     0  10   2017-07-20  1
     1  11   2017-06-20  1
     2  12   2017-05-20  2
     3  13   2017-04-20  2
     4  14   2017-03-20  3
     5  15   2017-02-20  3

Desired Output:

      amount start_time uid
     0  10   2017-07-20  1
     2  12   2017-05-20  2
     4  14   2017-03-20  3

I want to group by uid and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time.

I tried test.groupby(['uid'])['start_time'].max() but it doesn't work as it only returns back the uid and start_time column. I need the amount column as well.

Update: Thanks to @jezrael & @EdChum, you guys always help me out on this forum, thank you so much！

I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:

Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms

Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms

I guess groupby requires more time to compute.

jezrael · Accepted Answer

Use sort_values by column start_time with drop_duplicates by uid:

df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
   amount start_time  uid
0      10 2017-07-20    1
2      12 2017-05-20    2
4      14 2017-03-20    3

If need output with ordered uid:

print (test.sort_values('start_time', ascending=False)
           .drop_duplicates('uid')
           .sort_values('uid'))

How to remove duplicate entires using the latest time in Pandas

Answers (2)

Related Questions