mimic
mimic

Reputation: 5224

Pandas - count percentage of group size

Say, I have the data like this:

col1   col2 other columns..
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
1      0    ...
1      0    ...
etc...

Data has been grouped by 2 columns (it's already result by grouping):

gr = df.groupby(['col1', 'col2']).size()

col1   col2        
0      0           10
       1           5
1      0           2
       1           16
2      0           10

So now I need to figure out which percentage of each subgroup the count has respectively the whole group by 2 columns:

I need to add one more column, or transform to Series (better) to have a percentage of col2 respectively the group (col1) like:

       col1        col2
0      0           0.66
       1           0.33
1      0           0.1
       1           0.9
2      0           1

Or it can be separate Series for each group: [0.66 0.1 1] and [0.33 0.9]. How to implement it?

So let me describe the meaning of this data. For example it can be subjects (0,1,2), results (0 or 1) and amount students per subject per result. So the whole idea is to figure out what percentage of students failed/passed for subjects 0,1, etc.

One more additional thing - sometimes there is only one result (0 or 1) like the subject that all students passed, and I still need to be able to tell that for this subject percentage of 0 is 0,0 and of 1 is 1.

Upvotes: 4

Views: 3616

Answers (2)

jezrael
jezrael

Reputation: 862661

You need groupby by first level of index with sum:

gr = df.groupby(['col1', 'col2']).size()
print (gr)
col1  col2
0     0       10
      1        5
1     0        2
      1       16
2     0       10
dtype: int64

print (gr.groupby(level=0).sum())
col1
0    15
1    18
2    10
dtype: int64

print (gr / gr.groupby(level=0).sum())
col1  col2
0     0       0.666667
      1       0.333333
1     0       0.111111
      1       0.888889
2     0       1.000000
dtype: float64

For storing Series use dict comprehension:

dfs = {i:g.reset_index(drop=True) for i, g in g1.groupby(level=1)}

print (dfs[0])
0    0.666667
1    0.111111
2    1.000000
dtype: float64

print (dfs[1])
0    0.333333
1    0.888889
dtype: float64

Upvotes: 9

Andreas Hsieh
Andreas Hsieh

Reputation: 2150

You might be able to try this:

df = pd.DataFrame({'A':[0,1,0,1,0],'B':[10,5,2,16,10]}, index=[0,1,0,1,0])
df2 = df.ix[0] / df.ix[0].sum()
df3 = df.ix[1] / df.ix[1].sum()

Hope this will help.

Upvotes: 1

Related Questions