Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

Get top n values per category in pandas retaining all columns

After some transformations I got the following dataframe, how do I proceed to obtain the top n records by a column in this case short_name and using other as indicator frequency. I read this post but the problem with both solutions is that they get rid of the column product_name, they just retain the grouped column and I need to keep them all.

short_name          product_id    frequency
Yoghurt y cereales  975009684     32
Yoghurt y cereales  975009685     21
Yoghurt y cereales  975009700     16
Yoghurt y Cereales  21097         16
Yoghurt Bebible     21329         68
Yoghurt Bebible     21328         67
Yoghurt Bebible     21500         31

Upvotes: 9

Views: 4714

Answers (5)

guibor
guibor

Reputation: 580

df.groupby('short_name').nlargest(2, 'frequency').reset_index()

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

I'd try to use nlargest method:

In [5]: df.groupby('short_name', as_index=False).apply(lambda x: x.nlargest(2, 'frequency'))
Out[5]:
             short_name  product_id  frequency
0 4     Yoghurt Bebible       21329         68
  5     Yoghurt Bebible       21328         67
1 3  Yoghurt y Cereales       21097         16
2 0  Yoghurt y cereales   975009684         32
  1  Yoghurt y cereales   975009685         21

Upvotes: 10

Fredz0r
Fredz0r

Reputation: 621

If I understand your question correctly using .apply is not necessary in this case. This is something I would avoid unless there's no other option because of performance issues.

Try this:

df.sort_values(by='frequency',ascending=False).groupby('short_name').head(2).reset_index()

Upvotes: 4

EFT
EFT

Reputation: 2369

You could first sort the dataframe, then use groupby:

df.sort_values('frequency', ascending=False).groupby('short_name').head(2)
Out[28]: 
           short_name  product_id  frequency
4     Yoghurt Bebible       21329         68
5     Yoghurt Bebible       21328         67
0  Yoghurt y cereales   975009684         32
1  Yoghurt y cereales   975009685         21
3  Yoghurt y Cereales       21097         16

Upvotes: 8

Scott Boston
Scott Boston

Reputation: 153460

You can try this:

df.groupby('short_name', as_index=False).apply(lambda x: x.sort_values(by='frequency',ascending=False).head(2)).reset_index(drop=True)

Output:

           short_name  product_id  frequency
0     Yoghurt Bebible       21329         68
1     Yoghurt Bebible       21328         67
2  Yoghurt y Cereales       21097         16
3  Yoghurt y cereales   975009684         32
4  Yoghurt y cereales   975009685         21

Upvotes: 5

Related Questions