buhtz
buhtz

Reputation: 12172

Combine count and percentage (normalization) in pandas crosstab

I know that I can have percentage values in a pandas.crosstab() when normalize=True. But I want to combine absolute and normalized values in one table.

What I expect is a snipped like this

YEAR       2000     2001       2002
foo      n    %    n   %     n    %
A        1 0.16    0 0.0     1 0.16

This is an MWE producing normalized and counted values in separte tables.

#!/usr/bin/env python3
import pandas as pd

data = {'YEAR': [2001, 2001, 2002, 2000, 2000, 2002],
        'foo': ['C', 'B', 'A', 'A', 'B', 'C']}
        
df = pd.DataFrame(data)
df.YEAR = df.YEAR.astype('int16')
df.foo = df.foo.astype('category')

print(df)

taba = pd.crosstab(df.foo, df.YEAR, margins=True, dropna=False)
tabb = pd.crosstab(df.foo, df.YEAR, margins=True, dropna=False, normalize=True)

print(taba)
print(tabb)

Output of the MWE

   YEAR foo
0  2001   C
1  2001   B
2  2002   A
3  2000   A
4  2000   B
5  2002   C

YEAR  2000  2001  2002  All
foo
A        1     0     1    2
B        1     1     0    2
C        0     1     1    2
All      2     2     2    6

YEAR      2000      2001      2002       All
foo
A     0.166667  0.000000  0.166667  0.333333
B     0.166667  0.166667  0.000000  0.333333
C     0.000000  0.166667  0.166667  0.333333
All   0.333333  0.333333  0.333333  1.000000

Upvotes: 2

Views: 2451

Answers (2)

SeaBean
SeaBean

Reputation: 23217

You can join the 2 result dataframes and then rearrange the column index, as follows:

tab2 = taba.join(tabb, lsuffix='_n', rsuffix='_%')

tab2.columns = tab2.columns.map(lambda x: tuple(x.split('_')))

tab2 = (tab2.sort_index(ascending=[True, False] , axis=1)
            .rename_axis(columns=['YEAR', 'count_pct'], axis=1)
       )

Result:

YEAR      2000           2001           2002           All          
count_pct    n         %    n         %    n         %   n         %
foo                                                                 
A            1  0.166667    0  0.000000    1  0.166667   2  0.333333
B            1  0.166667    1  0.166667    0  0.000000   2  0.333333
C            0  0.000000    1  0.166667    1  0.166667   2  0.333333
All          2  0.333333    2  0.333333    2  0.333333   6  1.000000

Edit:

Breaking down the steps with more explanations on .sort_index() and rename_axis():

The interim resulting layout of tab2 before the last step is as follows:

YEAR 2000 2001 2002 All      2000      2001      2002       All
YEAR    n    n    n   n         %         %         %         %
foo                                                            
A       1    0    1   2  0.166667  0.000000  0.166667  0.333333
B       1    1    0   2  0.166667  0.166667  0.000000  0.333333
C       0    1    1   2  0.000000  0.166667  0.166667  0.333333
All     2    2    2   6  0.333333  0.333333  0.333333  1.000000

Here, 2 more fine-tunings we need to do:

  1. Group the columns by same years together, so that each n and % is under the same year. We do it by sort_index() here. axis=1 is to specify the index sorting is on columns instead of row index. The ascending= parameter is to specify the sorting order of the 2 levels of the column MultiIndex. The first True is to specify that the YEAR index should be sorted in ascending order, while the second False is to specify sorting 'n' and '%' in descending order. This is the required sorting for them to appear in the required sequence with 'n' to show before '%'.

Result:

YEAR 2000           2001           2002           All          
YEAR    n         %    n         %    n         %   n         %
foo                                                            
A       1  0.166667    0  0.000000    1  0.166667   2  0.333333
B       1  0.166667    1  0.166667    0  0.000000   2  0.333333
C       0  0.000000    1  0.166667    1  0.166667   2  0.333333
All     2  0.333333    2  0.333333    2  0.333333   6  1.000000
  1. Second fine-tuning is to change the axis-name of the lower MultiIndex from 'YEAR' to 'count_pct'. As you can see, there are now 2 'YEAR' on the left of the first and second line of the display. These corresponds to the axis names of first and second level column MultiIndex. We don't want all these axis names are the same. Hence, change by .rename_axis to:
YEAR      2000           2001           2002           All          
count_pct    n         %    n         %    n         %   n         %
foo                                                                 
A            1  0.166667    0  0.000000    1  0.166667   2  0.333333
B            1  0.166667    1  0.166667    0  0.000000   2  0.333333
C            0  0.000000    1  0.166667    1  0.166667   2  0.333333
All          2  0.333333    2  0.333333    2  0.333333   6  1.000000

The axis name of second (lower) level of column MultiIndex is changed to 'count_pct'.

Upvotes: 2

Code Different
Code Different

Reputation: 93161

Since you already calculated the two cross tabs, the simplest solution is to concatenate them into your final data frame:

taba = pd.crosstab(df.foo, df.YEAR, dropna=False)
tabb = pd.crosstab(df.foo, df.YEAR, dropna=False, normalize=True)

tab = (
    pd.concat([taba, tabb], axis=1, keys=['n', '%'])
      .swaplevel(axis=1)
      .sort_index(axis=1, ascending=[True, False])
      .rename_axis(['YEAR', 'foo'], axis=1)
)

Resulting output is:

YEAR 2000           2001           2002
foo     n         %    n         %    n         %
foo
A       1  0.166667    0  0.000000    1  0.166667
B       1  0.166667    1  0.166667    0  0.000000
C       0  0.000000    1  0.166667    1  0.166667

Upvotes: 1

Related Questions