Statistics for Grouped DataFrames with Pandas

Question

I have a DataFrame that can be grouped basically by two columns: Level and Sub_level.

The data looks like this:

    Level_1    Sub_level   Value

0    Group A   A1          100
1    Group A   A2          200
2    Group A   A1          150
3    Group B   B1          100
4    Group B   B2          200
5    Group A   A1          200
6    Group A   A1          300
7    Group A   A1          400
8    Group B   B2          450
...

I would like to get the frequency/count in each Sub_level compared to each comparable Level_1, i.e

Level_1   Sub_level   Pct_of_total

Group A   A1          5 / 6  (as there are 6 Group A instances in 'Level_1', and 5 A1:s in 'Sub_level')
          A2          1 / 6 
Group B   B1          1 / 3  (as there are 3 Group B instances in 'Level_1', and 1 B1:s in 'Sub_level')
          B2          2 / 3

Of course the fractions in the new column Pct_of_total should be expressed in percentage.

Any clues?

Thanks,

/N

jezrael · Accepted Answer

I think you need groupby + size for first df and then groupby by first level (Level_1) and transform sum. Last divide by div:

df1 = df.groupby(['Level_1','Sub_level'])['Value'].size()
print (df1)
Level_1  Sub_level
Group A  A1           5
         A2           1
Group B  B1           1
         B2           2
Name: Value, dtype: int64

df2 = df1.groupby(level=0).transform('sum')
print (df2)
Level_1  Sub_level
Group A  A1           6
         A2           6
Group B  B1           3
         B2           3
Name: Value, dtype: int64

df3 = df1.div(df2).reset_index(name='Pct_of_total')
print (df3)
   Level_1 Sub_level  Pct_of_total
0  Group A        A1      0.833333
1  Group A        A2      0.166667
2  Group B        B1      0.333333
3  Group B        B2      0.666667

Statistics for Grouped DataFrames with Pandas

Answers (1)

Related Questions