Cheng
Cheng

Reputation: 17904

How to remove duplicate entires using the latest time in Pandas

Here is the snippet:

test = pd.DataFrame({'uid':[1,1,2,2,3,3], 
                     'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
                     'amount': [10,11,12,13,14,15]})

Output:

      amount start_time uid
     0  10   2017-07-20  1
     1  11   2017-06-20  1
     2  12   2017-05-20  2
     3  13   2017-04-20  2
     4  14   2017-03-20  3
     5  15   2017-02-20  3

Desired Output:

      amount start_time uid
     0  10   2017-07-20  1
     2  12   2017-05-20  2
     4  14   2017-03-20  3

I want to group by uid and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time.

I tried test.groupby(['uid'])['start_time'].max() but it doesn't work as it only returns back the uid and start_time column. I need the amount column as well.


Update: Thanks to @jezrael & @EdChum, you guys always help me out on this forum, thank you so much!

I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:

Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms

Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms

I guess groupby requires more time to compute.

Upvotes: 1

Views: 32

Answers (2)

jezrael
jezrael

Reputation: 862771

Use sort_values by column start_time with drop_duplicates by uid:

df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
   amount start_time  uid
0      10 2017-07-20    1
2      12 2017-05-20    2
4      14 2017-03-20    3

If need output with ordered uid:

print (test.sort_values('start_time', ascending=False)
           .drop_duplicates('uid')
           .sort_values('uid'))

Upvotes: 0

EdChum
EdChum

Reputation: 394101

Use idxmax to return the index of the latest time and use this to index the original df:

In[35]:
test.loc[test.groupby('uid')['start_time'].idxmax()]

Out[35]: 
   amount start_time  uid
0      10 2017-07-20    1
2      12 2017-05-20    2
4      14 2017-03-20    3

Upvotes: 1

Related Questions