Reputation: 21282
A data frame:
df = pd.DataFrame({'First' : ['Mary', 'John', 'Jane', 'Mary', 'Jane', 'Mary', 'Mary'],
'Last' : ['Johnson', 'Smith', 'Doe', 'Johnson', 'Doe', 'Johnson', 'Johnson'],
'Group' : ['A', 'A', 'B', 'A', 'B', 'B', 'B'],
'Measure' : [10, 2, 11, 1, 20, 15, 15]})
First Last Group Measure
0 Mary Johnson A 10
1 John Smith A 2
2 Jane Doe B 11
3 Mary Johnson A 1
4 Jane Doe B 20
5 Mary Johnson B 15
6 Mary Johnson B 15
A person can appear in both groups, there ARE duplicates which are expected and wanted in these data.
I would like to reshape the dataframe by spreading Group variable across columns.
I can do so with pivot_table():
df.pivot_table(index=['First','Last'],
columns='Group',
values='Measure',
fill_value=0).reset_index()
Group First Last A B
0 Jane Doe 0.0 15.5
1 John Smith 2.0 0.0
2 Mary Johnson 5.5 15.0
This has defaulted to using mean for each grouping based on Measure. I would like to specify the aggregation function based on the newly generated columns which came from the original Group variable. In this case, I would like to use Max
on columnA and sum
on column B. Desired output:
First Last A B
0 Mary Johnson 10 30
1 John Smith 2 0
2 Jane Doe 0 31
Example Mary Johnson. For her values in GroupA, the max
is 10. For her values in group B, the sum
is 30.
Tried:
df.pivot_table(index=['First','Last'],
columns='Group',
values='Measure',
fill_value=0,
aggfunc = {'A': max,
'B': sum}).reset_index()
Which resulted in an error message KeyError: 'A'
How can I pivot my data frame and specify the aggregation function based on the new columns after pivot?
Upvotes: 3
Views: 58
Reputation: 164773
You can use a couple of GroupBy
objects, calculating max
or sum
as appropriate. Then concatenate results in a separate step.
A = df[df['Group'] == 'A'].groupby(['First', 'Last'])['Measure'].max().rename('A')
B = df[df['Group'] == 'B'].groupby(['First', 'Last'])['Measure'].sum().rename('B')
res = pd.concat([A, B], axis=1).fillna(0).astype(int).reset_index()
print(res)
First Last A B
0 Jane Doe 0 31
1 John Smith 2 0
2 Mary Johnson 10 30
Upvotes: 0
Reputation: 59274
You can always specify both and just filter
ndf = df.pivot_table(index=['First','Last'],
columns='Group',
values='Measure',
fill_value=0,
aggfunc=['sum', 'max'])
ndf.loc[:, ((ndf.columns.get_level_values(0)=='max') & (ndf.columns.get_level_values(1)=='A') ) | \
((ndf.columns.get_level_values(0)=='sum') & (ndf.columns.get_level_values(1)=='B') )]
sum max
Group B A
First Last
Jane Doe 31 0
John Smith 0 2
Mary Johnson 30 10
Upvotes: 3