Doug Fir
Doug Fir

Reputation: 21282

Choose different aggregation functions during pivot

A data frame:

df = pd.DataFrame({'First' : ['Mary', 'John', 'Jane', 'Mary', 'Jane', 'Mary', 'Mary'], 
                   'Last' : ['Johnson', 'Smith', 'Doe', 'Johnson', 'Doe', 'Johnson', 'Johnson'], 
                   'Group' : ['A', 'A', 'B', 'A', 'B', 'B', 'B'], 
                   'Measure' : [10, 2, 11, 1, 20, 15, 15]})

  First     Last Group  Measure
0  Mary  Johnson     A       10
1  John    Smith     A        2
2  Jane      Doe     B       11
3  Mary  Johnson     A        1
4  Jane      Doe     B       20
5  Mary  Johnson     B       15
6  Mary  Johnson     B       15

A person can appear in both groups, there ARE duplicates which are expected and wanted in these data.

I would like to reshape the dataframe by spreading Group variable across columns.

I can do so with pivot_table():

df.pivot_table(index=['First','Last'],
               columns='Group',
               values='Measure',
               fill_value=0).reset_index()

Group First     Last    A     B
0      Jane      Doe  0.0  15.5
1      John    Smith  2.0   0.0
2      Mary  Johnson  5.5  15.0

This has defaulted to using mean for each grouping based on Measure. I would like to specify the aggregation function based on the newly generated columns which came from the original Group variable. In this case, I would like to use Max on columnA and sum on column B. Desired output:

  First     Last   A   B
0  Mary  Johnson  10  30
1  John    Smith   2   0
2  Jane      Doe   0  31

Example Mary Johnson. For her values in GroupA, the max is 10. For her values in group B, the sum is 30.

Tried:

df.pivot_table(index=['First','Last'],
               columns='Group',
               values='Measure',
               fill_value=0,
               aggfunc = {'A': max,
                          'B': sum}).reset_index()

Which resulted in an error message KeyError: 'A'

How can I pivot my data frame and specify the aggregation function based on the new columns after pivot?

Upvotes: 3

Views: 58

Answers (2)

jpp
jpp

Reputation: 164773

You can use a couple of GroupBy objects, calculating max or sum as appropriate. Then concatenate results in a separate step.

A = df[df['Group'] == 'A'].groupby(['First', 'Last'])['Measure'].max().rename('A')
B = df[df['Group'] == 'B'].groupby(['First', 'Last'])['Measure'].sum().rename('B')

res = pd.concat([A, B], axis=1).fillna(0).astype(int).reset_index()

print(res)

  First     Last   A   B
0  Jane      Doe   0  31
1  John    Smith   2   0
2  Mary  Johnson  10  30

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

You can always specify both and just filter

ndf = df.pivot_table(index=['First','Last'],
               columns='Group',
               values='Measure',
               fill_value=0,
               aggfunc=['sum', 'max'])

ndf.loc[:, ((ndf.columns.get_level_values(0)=='max') & (ndf.columns.get_level_values(1)=='A') ) | \
           ((ndf.columns.get_level_values(0)=='sum') & (ndf.columns.get_level_values(1)=='B') )]

                  sum   max
        Group     B     A
First   Last        
Jane    Doe       31    0
John    Smith     0     2
Mary    Johnson   30    10

Upvotes: 3

Related Questions