Reputation: 2229
Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)})
pd.crosstab(df.A,df.B)
B A B C
A
one 4 4 4
three 2 2 2
two 2 2 2
Expected output:
B A B C
A
one .33 .33 .33
three .33 .33 .33
two .33 .33 .33
Upvotes: 84
Views: 112233
Reputation: 31
Normalizing the index will simply work out. Use parameter, normalize = "index"
in pd.crosstab()
.
Upvotes: 3
Reputation: 10564
We can show it as percentages by multiplying by 100
:
pd.crosstab(df.A,df.B, normalize='index')\
.round(4)*100
B A B C
A
one 33.33 33.33 33.33
three 33.33 33.33 33.33
two 33.33 33.33 33.33
Where I've rounded for convenience.
Upvotes: 19
Reputation: 3412
From Pandas 0.18.1 onwards, there's a normalize
option:
In [1]: pd.crosstab(df.A,df.B, normalize='index')
Out[1]:
B A B C
A
one 0.333333 0.333333 0.333333
three 0.333333 0.333333 0.333333
two 0.333333 0.333333 0.333333
Where you can normalise across either all
, index
(rows), or columns
.
More details are available in the documentation.
Upvotes: 111
Reputation: 1780
If you're looking for a percentage of the total, you can divide by the len of the df instead of the row sum:
pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)
Upvotes: 3
Reputation: 375535
Another option is to use div rather than apply:
In [11]: res = pd.crosstab(df.A, df.B)
Divide by the sum over the index:
In [12]: res.sum(axis=1)
Out[12]:
A
one 12
three 6
two 6
dtype: int64
Similar to above, you need to do something about integer division (I use astype('float')):
In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
Out[13]:
B A B C
A
one 0.333333 0.333333 0.333333
three 0.333333 0.333333 0.333333
two 0.333333 0.333333 0.333333
Upvotes: 2
Reputation: 251398
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
Basically you just have the function that does row/row.sum()
, and you use apply
with axis=1
to apply it by row.
(If doing this in Python 2, you should use from __future__ import division
to make sure division always returns a float.)
Upvotes: 81