Reputation: 4148
I have a table in the below format. I would like to do the value counts of both the columns (year and operation) and get their percentage.
For example, in year 2014
, value yes
appears 2
out of 3
times, hence 2/3 = 0.66
. I tried with value_counts
but it did not yield the below results.
Any leads would be appreciated.
df[['year', 'operation']].apply(pd.Series.value_counts)
year operation
2014 yes
2014 yes
2014 no
2015
2015 yes
2015 yes
Result:
2014 yes 0.66
2014 no 0.33
2015 0.33
2015 yes 0.66
Upvotes: 2
Views: 5842
Reputation: 35636
Let's try with SeriesGroupBy.value_counts
and set normalize=True
to get the values as a percentage:
out = df.groupby('year')['operation'].value_counts(normalize=True)
out
:
year operation
2014 yes 0.666667
no 0.333333
2015 yes 0.666667
0.333333
Name: operation, dtype: float64
Can also set sort=False
to not sort with highest value per level 0:
out = df.groupby('year')['operation'].value_counts(normalize=True, sort=False)
out
:
year operation
2014 no 0.333333
yes 0.666667
2015 0.333333
yes 0.666667
Name: operation, dtype: float64
Series.reset_index
can be used with name=
set to create a DataFrame instead of a Series and give a name to the unnamed values column:
new_df = (
df.groupby('year')['operation'].value_counts(normalize=True)
.reset_index(name='freq')
)
year operation freq
0 2014 yes 0.666667
1 2014 no 0.333333
2 2015 yes 0.666667
3 2015 0.333333
DataFrame Used:
df = pd.DataFrame({'year': [2014, 2014, 2014, 2015, 2015, 2015],
'operation': ['yes', 'yes', 'no', '', 'yes', 'yes']})
Upvotes: 6