Reputation: 18022
After some transformations I got the following dataframe
, how do I proceed to obtain the top n records by a column in this case short_name
and using other as indicator frequency
. I read this post but the problem with both solutions is that they get rid of the column product_name
, they just retain the grouped column and I need to keep them all.
short_name product_id frequency
Yoghurt y cereales 975009684 32
Yoghurt y cereales 975009685 21
Yoghurt y cereales 975009700 16
Yoghurt y Cereales 21097 16
Yoghurt Bebible 21329 68
Yoghurt Bebible 21328 67
Yoghurt Bebible 21500 31
Upvotes: 9
Views: 4714
Reputation: 210832
I'd try to use nlargest
method:
In [5]: df.groupby('short_name', as_index=False).apply(lambda x: x.nlargest(2, 'frequency'))
Out[5]:
short_name product_id frequency
0 4 Yoghurt Bebible 21329 68
5 Yoghurt Bebible 21328 67
1 3 Yoghurt y Cereales 21097 16
2 0 Yoghurt y cereales 975009684 32
1 Yoghurt y cereales 975009685 21
Upvotes: 10
Reputation: 621
If I understand your question correctly using .apply is not necessary in this case. This is something I would avoid unless there's no other option because of performance issues.
Try this:
df.sort_values(by='frequency',ascending=False).groupby('short_name').head(2).reset_index()
Upvotes: 4
Reputation: 2369
You could first sort the dataframe, then use groupby
:
df.sort_values('frequency', ascending=False).groupby('short_name').head(2)
Out[28]:
short_name product_id frequency
4 Yoghurt Bebible 21329 68
5 Yoghurt Bebible 21328 67
0 Yoghurt y cereales 975009684 32
1 Yoghurt y cereales 975009685 21
3 Yoghurt y Cereales 21097 16
Upvotes: 8
Reputation: 153460
You can try this:
df.groupby('short_name', as_index=False).apply(lambda x: x.sort_values(by='frequency',ascending=False).head(2)).reset_index(drop=True)
Output:
short_name product_id frequency
0 Yoghurt Bebible 21329 68
1 Yoghurt Bebible 21328 67
2 Yoghurt y Cereales 21097 16
3 Yoghurt y cereales 975009684 32
4 Yoghurt y cereales 975009685 21
Upvotes: 5