Reputation: 372
I have a pandas dataframe like this :
>>> df
id value
0 1 10
1 1 11
2 1 9
3 2 7
4 2 7
5 2 8
6 3 10
7 3 8
I want to get the top two id, based on the sum of their top two values. So here, I should get this :
id # value
0 1 # 11 + 10 = 21
1 3 # 10 + 8 = 18
I tried using :
df.groupby('id')['value'].nlargest(2).sum()
But that returns the sum of all the largest values.
I looked for an answer to my question, but I didn't quite get the proper one.
Upvotes: 2
Views: 7006
Reputation:
groupby.nlargest(2)
returns a Series with MultiIndex:
df.groupby('id')['value'].nlargest(2)
Out:
id
1 1 11
0 10
2 5 8
3 7
3 6 10
7 8
Name: value, dtype: int64
Here, both the id and the original index appear in the returning Series. Now if you take the sum, it will take the sum of every value in this Series. However, if you apply the sum on level=0 (or on the id part of this MultiIndex), it will only take the sum for each id separately.
df.groupby('id')['value'].nlargest(2).sum(level=0)
Out:
id
1 21
2 15
3 18
Name: value, dtype: int64
Now you have the sum of two largest values for each id. To find the largest two values in this Series you need to call nlargest
again:
df.groupby('id')['value'].nlargest(2).sum(level=0).nlargest(2)
Out:
id
1 21
3 18
Name: value, dtype: int64
Upvotes: 7