Reputation: 9213
I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]
Upvotes: 0
Views: 187
Reputation: 323276
IIUC you will need two groupby
here, one is to get the sum
, then we base on the group select the max
again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
Upvotes: 1