daydayup
daydayup

Reputation: 2317

pandas nlargest lost one column

I have this dataset:

Id   query  count
001  abc    20
001  bcd    30
001  ccd   100
002  ace   13
002  ahhd   30
002  ahe    28

I want to find the Top2 query for each Id, based on the count. So I want to see:

Id   query  count
001  ccd    100
001  bcd    30
002  ahhd   30
002  ahe    28

I tried these two lines of code:

df.groupby('Id')['count'].nlargest(2), the "query" column is lost in the result, which is not what I wanted. So how to keep query in my result. Id count

001     100
001     30
002     30
002     28

Upvotes: 1

Views: 1044

Answers (3)

piRSquared
piRSquared

Reputation: 294318

I use a groupby and apply the method pd.DataFrame.nlargest. This differs from pd.Series.nlargest in that I have to specify a set of columns to consider when choosing my n rows. This solution keeps the original index values that are attached to the rows, if that is at all important to the OP or end user.

df.groupby('Id', group_keys=False).apply(
    pd.DataFrame.nlargest, n=2, columns='count')

   Id query  count
2   1   ccd    100
1   1   bcd     30
4   2  ahhd     30
5   2   ahe     28

Upvotes: 2

greg_data
greg_data

Reputation: 2293

You could do this with groupby still:

df.sort_values('count', ascending = False).groupby('Id').head(2)

Upvotes: 1

jezrael
jezrael

Reputation: 862741

Use set_index of missing column(s):

df = df.set_index('query').groupby('Id')['count'].nlargest(2).reset_index()
print (df)
    Id query  count
0  001   ccd    100
1  001   bcd     30
2  002  ahhd     30
3  002   ahe     28

Upvotes: 3

Related Questions