cph_sto
cph_sto

Reputation: 7587

Creating a DataFrame by aggregating by each column

I have a DataFrame, where all columns are in binary form, i.e; taking either 0 or 1 as value as shown below.

import pandas as pd
df = pd.DataFrame({'qualified_exam':[1,1,0,0,1,1],'gender_M':[1,0,1,0,1,1],'employed':[1,0,0,1,1,1],'married':[0,1,0,1,1,0]})
print(df)
      qualified_exam  gender_M  employed  married
   0               1         1         1        0
   1               1         0         0        1
   2               0         1         0        0
   3               0         0         1        1
   4               1         1         1        1
   5               1         1         1        0

I want to create a DataFrame where I want to measure sum/mean of column qualified _exam by grouping by all the remaining 3 columns individually - gender_M, employed, married.

Final DataFrame should look something like this -

            sum_0   sum_1    mean_0    mean_1
gender_M        1       3      0.50      0.75
employed        1       3      0.50      0.75
 married        2       2      0.66      0.66

I tried it by doing groupby() and agg() for each of the 3 columns individually, and then appending the result one by one. This is too cumbersome. I am sure there is a better way.

Upvotes: 1

Views: 33

Answers (1)

jezrael
jezrael

Reputation: 862681

Use DataFrame.melt with aggregate sum with mean, reshape by DataFrame.unstack and last flatten MultiIndex in columns:

df1 = (df.melt('qualified_exam')
         .groupby(['variable', 'value'])['qualified_exam']
         .agg(['sum','mean'])
         .unstack()

         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
          sum_0  sum_1    mean_0    mean_1
variable                                  
employed      1      3  0.500000  0.750000
gender_M      1      3  0.500000  0.750000
married       2      2  0.666667  0.666667

Or use DataFrame.pivot_table:

df1 = (df.melt('qualified_exam')
         .pivot_table(index='variable', 
                      columns='value', 
                      values='qualified_exam', 
                      aggfunc=('sum','mean')))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
            mean_0    mean_1  sum_0  sum_1
variable                                  
employed  0.500000  0.750000    1.0    3.0
gender_M  0.500000  0.750000    1.0    3.0
married   0.666667  0.666667    2.0    2.0

Upvotes: 2

Related Questions