Brian Keegan
Brian Keegan

Reputation: 2229

How to make a pandas crosstab with percentages?

Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
                   'B' : ['A', 'B', 'C'] * 8,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
                   'D' : np.random.randn(24),
                   'E' : np.random.randn(24)})


pd.crosstab(df.A,df.B)


B       A    B    C
A               
one     4    4    4
three   2    2    2
two     2    2    2

Expected output:

B       A     B    C
A               
one     .33  .33  .33
three   .33  .33  .33
two     .33  .33  .33

Upvotes: 84

Views: 112233

Answers (6)

Shivam Aranya
Shivam Aranya

Reputation: 31

Normalizing the index will simply work out. Use parameter, normalize = "index" in pd.crosstab().

Upvotes: 3

gabra
gabra

Reputation: 10564

We can show it as percentages by multiplying by 100:

pd.crosstab(df.A,df.B, normalize='index')\
    .round(4)*100

B          A      B      C
A                         
one    33.33  33.33  33.33
three  33.33  33.33  33.33
two    33.33  33.33  33.33

Where I've rounded for convenience.

Upvotes: 19

Harry
Harry

Reputation: 3412

From Pandas 0.18.1 onwards, there's a normalize option:

In [1]: pd.crosstab(df.A,df.B, normalize='index')
Out[1]:

B              A           B           C
A           
one     0.333333    0.333333    0.333333
three   0.333333    0.333333    0.333333
two     0.333333    0.333333    0.333333

Where you can normalise across either all, index (rows), or columns.

More details are available in the documentation.

Upvotes: 111

If you're looking for a percentage of the total, you can divide by the len of the df instead of the row sum:

pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375535

Another option is to use div rather than apply:

In [11]: res = pd.crosstab(df.A, df.B)

Divide by the sum over the index:

In [12]: res.sum(axis=1)
Out[12]: 
A
one      12
three     6
two       6
dtype: int64

Similar to above, you need to do something about integer division (I use astype('float')):

In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
Out[13]: 
B             A         B         C
A                                  
one    0.333333  0.333333  0.333333
three  0.333333  0.333333  0.333333
two    0.333333  0.333333  0.333333

Upvotes: 2

BrenBarn
BrenBarn

Reputation: 251398

pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)

Basically you just have the function that does row/row.sum(), and you use apply with axis=1 to apply it by row.

(If doing this in Python 2, you should use from __future__ import division to make sure division always returns a float.)

Upvotes: 81

Related Questions