Meher Béjaoui
Meher Béjaoui

Reputation: 372

Pandas : Get top n records based on the sum of every top i record for each group

I have a pandas dataframe like this :

>>> df

    id    value   
0    1      10
1    1      11
2    1      9
3    2      7
4    2      7
5    2      8
6    3      10
7    3      8

I want to get the top two id, based on the sum of their top two values. So here, I should get this :

    id    # value   
0    1    # 11 + 10 = 21
1    3    # 10 + 8 = 18

I tried using :

df.groupby('id')['value'].nlargest(2).sum()

But that returns the sum of all the largest values.

I looked for an answer to my question, but I didn't quite get the proper one.

Upvotes: 2

Views: 7006

Answers (1)

user2285236
user2285236

Reputation:

groupby.nlargest(2) returns a Series with MultiIndex:

df.groupby('id')['value'].nlargest(2)
Out: 
id   
1   1    11
    0    10
2   5     8
    3     7
3   6    10
    7     8
Name: value, dtype: int64

Here, both the id and the original index appear in the returning Series. Now if you take the sum, it will take the sum of every value in this Series. However, if you apply the sum on level=0 (or on the id part of this MultiIndex), it will only take the sum for each id separately.

df.groupby('id')['value'].nlargest(2).sum(level=0)
Out: 
id
1    21
2    15
3    18
Name: value, dtype: int64

Now you have the sum of two largest values for each id. To find the largest two values in this Series you need to call nlargest again:

df.groupby('id')['value'].nlargest(2).sum(level=0).nlargest(2)
Out: 
id
1    21
3    18
Name: value, dtype: int64

Upvotes: 7

Related Questions