Pandas: Group by distinct values of each cell and split column into multiple columns

Question

I want to group-by every distinct cell and splitting the 'Status'-column into multiple columns based on their distinct values. The values of the new column(s) shall have the sum of 'Count' based on the occurrences.

My data:

        Department    Age   Salary  Status  Count
0            Sales  31-35  46K-50K  Senior     30
1            Sales  26-30  26K-30K  Junior     40
2            Sales  31-35  31K-35K  Junior     40
3               IT  21-25  46K-50K  Junior     20
4               IT  31-35  66K-70K  Senior      5
5               IT  26-30  46K-50K  Junior      3
6               IT  41-45  66K-70K  Senior      3
7        Marketing  36-40  46K-50K  Senior     10
8        Marketing  31-35  41K-45K  Junior      4
9   Administration  46-50  36K-40K  Senior      4
10  Administration  26-30  26K-30K  Junior      6

I want it to be:

        Attribute  Junior  Senior
0  Administration       6       4
1              IT      23       8
2       Marketing       4      10
3           Sales      80      30
4           21-25      20       0
5           26-30      49       0
6           31-35      44      35
7           36-40       0      10
8           41-45       0       3
9           46-50       0       4
10        26K-30K      46       0
11        31K-35K      40       0
12        36K-40K       0       4
13        41K-45K       4       0
14        46K-50K      23      40
15        66K-70K       0       8

Code:

import pandas as pd

df = pd.DataFrame({'Department': ['Sales', 'Sales', 'Sales',
                                  'IT', 'IT', 'IT', 'IT',
                                  'Marketing', 'Marketing',
                                 'Administration', 'Administration'],
                   'Age': ['31-35', '26-30', '31-35', '21-25', '31-35', 
                           '26-30', '41-45', '36-40', '31-35', '46-50',
                           '26-30'],
                   'Salary': ['46K-50K', '26K-30K', '31K-35K', '46K-50K',
                              '66K-70K', '46K-50K', '66K-70K', '46K-50K',
                              '41K-45K', '36K-40K', '26K-30K'],
                   'Status': ['Senior', 'Junior', 'Junior', 'Junior',
                              'Senior', 'Junior', 'Senior', 'Senior',
                              'Junior', 'Senior', 'Junior'],
                   'Count': [30, 40, 40, 20, 5, 3, 3, 10, 4, 4, 6]},
                    columns=['Department', 'Age', 'Salary', 'Status', 
                    'Count'])

df2 = df.groupby(['Department']).Count.sum() 
    .append(df.groupby(['Age']).Count.sum()) 
    .append(df.groupby(['Salary']).Count.sum())

This gives me:

Administration     10
IT                 31
Marketing          14
Sales             110
21-25              20
26-30              49
31-35              79
36-40              10
41-45               3
46-50               4
26K-30K            46
31K-35K            40
36K-40K             4
41K-45K             4
46K-50K            63
66K-70K             8

I can't figure out how to do the split on 'Status' and I also think that there's a misuse of append() in here.

John Zwinck · Accepted Answer

You're almost there.

cols = ['Department', 'Age', 'Salary']
parts = [df.groupby([col, 'Status']).Count.sum() for col in cols]
df2 = pd.concat(parts).unstack(fill_value=0)

I used pd.concat() instead of repeated append() because as you pointed out, append() is not very good (it's slow).

Splitting on Status is easy: just add it to groupby() and then unstack() it at the end to turn it into column rather than row labels.

Pandas: Group by distinct values of each cell and split column into multiple columns

Answers (2)

Related Questions