Reputation: 17904
Here is the snippet:
test = pd.DataFrame({'uid':[1,1,2,2,3,3],
'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
'amount': [10,11,12,13,14,15]})
Output:
amount start_time uid
0 10 2017-07-20 1
1 11 2017-06-20 1
2 12 2017-05-20 2
3 13 2017-04-20 2
4 14 2017-03-20 3
5 15 2017-02-20 3
Desired Output:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
I want to group by uid
and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time
.
I tried test.groupby(['uid'])['start_time'].max()
but it doesn't work as it only returns back the uid
and start_time
column. I need the amount
column as well.
Update: Thanks to @jezrael & @EdChum, you guys always help me out on this forum, thank you so much!
I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:
Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms
Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms
I guess groupby
requires more time to compute.
Upvotes: 1
Views: 32
Reputation: 862771
Use sort_values
by column start_time
with drop_duplicates
by uid
:
df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
If need output with ordered uid
:
print (test.sort_values('start_time', ascending=False)
.drop_duplicates('uid')
.sort_values('uid'))
Upvotes: 0