Reputation: 649
I want to group-by every distinct cell and splitting the 'Status'-column into multiple columns based on their distinct values. The values of the new column(s) shall have the sum of 'Count' based on the occurrences.
My data:
Department Age Salary Status Count
0 Sales 31-35 46K-50K Senior 30
1 Sales 26-30 26K-30K Junior 40
2 Sales 31-35 31K-35K Junior 40
3 IT 21-25 46K-50K Junior 20
4 IT 31-35 66K-70K Senior 5
5 IT 26-30 46K-50K Junior 3
6 IT 41-45 66K-70K Senior 3
7 Marketing 36-40 46K-50K Senior 10
8 Marketing 31-35 41K-45K Junior 4
9 Administration 46-50 36K-40K Senior 4
10 Administration 26-30 26K-30K Junior 6
I want it to be:
Attribute Junior Senior
0 Administration 6 4
1 IT 23 8
2 Marketing 4 10
3 Sales 80 30
4 21-25 20 0
5 26-30 49 0
6 31-35 44 35
7 36-40 0 10
8 41-45 0 3
9 46-50 0 4
10 26K-30K 46 0
11 31K-35K 40 0
12 36K-40K 0 4
13 41K-45K 4 0
14 46K-50K 23 40
15 66K-70K 0 8
Code:
import pandas as pd
df = pd.DataFrame({'Department': ['Sales', 'Sales', 'Sales',
'IT', 'IT', 'IT', 'IT',
'Marketing', 'Marketing',
'Administration', 'Administration'],
'Age': ['31-35', '26-30', '31-35', '21-25', '31-35',
'26-30', '41-45', '36-40', '31-35', '46-50',
'26-30'],
'Salary': ['46K-50K', '26K-30K', '31K-35K', '46K-50K',
'66K-70K', '46K-50K', '66K-70K', '46K-50K',
'41K-45K', '36K-40K', '26K-30K'],
'Status': ['Senior', 'Junior', 'Junior', 'Junior',
'Senior', 'Junior', 'Senior', 'Senior',
'Junior', 'Senior', 'Junior'],
'Count': [30, 40, 40, 20, 5, 3, 3, 10, 4, 4, 6]},
columns=['Department', 'Age', 'Salary', 'Status',
'Count'])
df2 = df.groupby(['Department']).Count.sum()
.append(df.groupby(['Age']).Count.sum())
.append(df.groupby(['Salary']).Count.sum())
This gives me:
Administration 10
IT 31
Marketing 14
Sales 110
21-25 20
26-30 49
31-35 79
36-40 10
41-45 3
46-50 4
26K-30K 46
31K-35K 40
36K-40K 4
41K-45K 4
46K-50K 63
66K-70K 8
I can't figure out how to do the split on 'Status' and I also think that there's a misuse of append()
in here.
Upvotes: 0
Views: 2149
Reputation: 9081
Use pd.pivot_table()
-
One-Liner
df_pivot = pd.pivot_table(df, columns=['Status'], index=['Department'], aggfunc=np.sum).fillna(0).append(pd.pivot_table(df, columns=['Status'], index=['Salary'],aggfunc=np.sum).fillna(0))
Output
Count
Status Junior Senior
Administration 6.0 4.0
IT 23.0 8.0
Marketing 4.0 10.0
Sales 80.0 30.0
26K-30K 46.0 0.0
31K-35K 40.0 0.0
36K-40K 0.0 4.0
41K-45K 4.0 0.0
46K-50K 23.0 40.0
66K-70K 0.0 8.0
Explanation
Pandas pivot table is a convenience function to output pivot tables like excel. So your output is a combination of generating a pivot on the Department
columns then the Salary
columns.
So you can start with generating separate pivot tables with index
and columns
specified and then use append
to concatenate them.
df1=pd.pivot_table(df, columns=['Status'], index=['Department'], aggfunc=np.sum).fillna(0)
Then the Salary
part -
df2=pd.pivot_table(df, columns=['Status'], index=['Salary'],aggfunc=np.sum).fillna(0)
Finally -
df3 = df1.append(df2)
Upvotes: 1
Reputation: 249394
You're almost there.
cols = ['Department', 'Age', 'Salary']
parts = [df.groupby([col, 'Status']).Count.sum() for col in cols]
df2 = pd.concat(parts).unstack(fill_value=0)
I used pd.concat()
instead of repeated append()
because as you pointed out, append()
is not very good (it's slow).
Splitting on Status is easy: just add it to groupby()
and then unstack()
it at the end to turn it into column rather than row labels.
Upvotes: 2