Reputation: 394
I have the following Pandas Series:
Count
Pclass Survived
1 0 80
1 136
2 0 97
1 87
3 0 372
1 119
But I want something like this:
Count Percentage
Pclass Survived
1 0 80 37.0
1 136 63.0
2 0 97 52.7
1 87 47.3
3 0 372 75.8
1 119 24.2
I want to calculate the percentages based on only the counts in each category of Pclass (not the whole sum of counts). It would be great if these percentages are calculated using only the Count column.
So far, what I did was I summed up the counts according to Pclass and used .repeat() to duplicate the values and tried to concatenate it to the original Series, which failed spectacularly.
Count
Pclass
1 216
1 216
2 184
2 184
3 491
3 491
My idea was to use this repeat column as the denominators in calculating the percentages, like this:
80 / 216 * 100 = 37.0%
and then remove the repeat column after the percentages are calculated. Seems so simple but I can't seem to make it work. Any help is appreciated.
Upvotes: 5
Views: 2259
Reputation: 294488
div
and sum
with level=0
df.assign(Pct=df.div(df.sum(level=0), level=0).round(2) * 100)
Count Pct
Pclass Survived
1 0 80 37.0
1 136 63.0
2 0 97 53.0
1 87 47.0
3 0 372 76.0
1 119 24.0
Upvotes: 6
Reputation: 2668
You are working on titanic dataset. You can also do like below on the original data (not on your Series above):
df.Survived.groupby(df.Pclass).value_counts(normalize=True)*100
It will give you count of each class in percentage rather than in actual number.
Upvotes: 0
Reputation: 863246
Use GroupBy.transform
for return Series
with same size as original DataFrame
, divide by div
, multiple by mul
and if necessary round
:
#if input is Series create one column DataFrame
df = s.to_frame('Count')
s = df.groupby('Pclass')['Count'].transform('sum')
#if use oldier pandas version
#s = df.groupby(level='Pclass')['Count'].transform('sum')
df['Percentage'] = df['Count'].div(s).mul(100).round()
print (df)
Count Percentage
Pclass Survived
1 0 80 37.0
1 136 63.0
2 0 97 53.0
1 87 47.0
3 0 372 76.0
1 119 24.0
EDIT:
dont you have to forward fill on PClass ?
It is not necessary, because default in MultiIndex are not shown, but if set multi_sparse
to False
is possible verify repeating values:
with pd.option_context('display.multi_sparse', False):
print (df)
Count
Pclass Survived
1 0 80
1 1 136
2 0 97
2 1 87
3 0 372
3 1 119
Upvotes: 5