user2242044
user2242044

Reputation: 9213

Keep largest value based on sum of two groupbys in pandas

I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).

My data:

import pandas as pd

df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
                        ['0', 'A', 2, 'b'],
                        ['0', 'A', 1, 'c'],
                        ['0', 'B', 3, 'd'],
                        ['0', 'B', 4, 'e'],
                        ['0', 'B', 4, 'f'],
                        ['1', 'C', 3, 'g'],
                        ['1', 'C', 2, 'h'],
                        ['1', 'C', 1, 'i'],
                        ['1', 'D', 3, 'j'],
                        ['1', 'D', 4, 'k'],
                        ['1', 'D', 4, 'l']
                       ], columns=['group col 1', 'group col 2', 'value', 'random data']
                        )

Desired output:

       group col 1 group col 2  value random data
3            0           B      3           d
4            0           B      4           e
5            0           B      4           f
9            1           D      3           j
10           1           D      4           k
11           1           D      4           l

I have an inefficient way of getting there, but looking for a simpler solution.

My solution:

df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]

Upvotes: 0

Views: 187

Answers (1)

BENY
BENY

Reputation: 323276

IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again

s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
   group col 1 group col 2  value random data
3            0           B      3           d
4            0           B      4           e
5            0           B      4           f
9            1           D      3           j
10           1           D      4           k
11           1           D      4           l

Upvotes: 1

Related Questions