Reputation: 1829
I have a dataframe of election candidates, donors' occupation, and donated(received) amount. And, I am trying to find out the top 7 largest amount received by each candidate.
candidate name = cand_nm donors' occupation = contbr_occupation received amount = contb_receipt_amt
So I first grouped the dataframe by candidates' name and donor's occupation, and add up donation amount using .sum()
grouped = df.groupby(['cand_nm','contbr_occupation'])['contb_receipt_amt'].sum()
Then, I use nlargest() as below, but it returns top 7 amount from the entire series, not from each group. How can I calculate top 7 donation amount from each group?
grouped.nlargest(7)
Another question is "grouped" variable appears to be an indexed series. But when I print out its index using grouped.index
It doesn't return "cand_nm" or "contbr_occupation". Am I wrong to think that this is an indexed series?
Upvotes: 1
Views: 43
Reputation: 862891
You can use SeriesGroupBy.nlargest
with group_keys=False
for avoid duplicated level of MultiIndex
:
s1 = grouped.groupby(level=0, group_keys=False).nlargest(7)
Or use Series.sort_values
with GroupBy.head
:
s1 = grouped.sort_values(ascending=False).groupby(level=0).head(7)
Sample:
df = pd.DataFrame({
'contbr_occupation':list('abcdef'),
'cand_nm':list('aaabbb'),
'contb_receipt_amt':[7,8,9,4,2,3]
})
grouped = df.groupby(['cand_nm','contbr_occupation'])['contb_receipt_amt'].sum()
s1 = grouped.sort_values(ascending=False).groupby(level=0).head(2)
print (s1)
cand_nm contbr_occupation
a c 9
b 8
b d 4
f 3
Name: contb_receipt_amt, dtype: int64
s1 = grouped.groupby(level=0, group_keys=False).nlargest(2)
print (s1)
cand_nm contbr_occupation
a c 9
b 8
b d 4
f 3
Name: contb_receipt_amt, dtype: int64
Last for DataFrame
add Series.reset_index
:
df1 = s1.reset_index()
print (df1)
cand_nm contbr_occupation contb_receipt_amt
0 a c 9
1 a b 8
2 b d 4
3 b f 3
Upvotes: 1