Reputation: 1102
Is there a better way to create a contingency table in pandas with pd.crosstab() or pd.pivot_table() to generate counts and percentages.
Current solution
cat=['A','B','B','A','B','B','A','A','B','B']
target = [True,False,False,False,True,True,False,True,True,True]
import pandas as pd
df=pd.DataFrame({'cat' :cat,'target':target})
using crosstab
totals=pd.crosstab(df['cat'],df['target'],margins=True).reset_index()
percentages = pd.crosstab(df['cat'],
df['target']).apply(lambda row: row/row.sum(),axis=1).reset_index()
and a merge
summaryTable=pd.merge(totals,percentages,on="cat")
summaryTable.columns=['cat','#False',
'#True','All','percentTrue','percentFalse']
output
+---+-----+--------+-------+-----+-------------+--------------+
| | cat | #False | #True | All | percentTrue | percentFalse |
+---+-----+--------+-------+-----+-------------+--------------+
| 0 | A | 2 | 2 | 4 | 0.500000 | 0.500000 |
| 1 | B | 2 | 4 | 6 | 0.333333 | 0.666667 |
+---+-----+--------+-------+-----+-------------+--------------+
Upvotes: 8
Views: 5258
Reputation: 673
I don't have a better answer than the merge, but you could replace
percentages = pd.crosstab(df['cat'],
df['target']).apply(lambda row: row/row.sum(),axis=1).reset_index()
with
percentages = pd.crosstab(df['cat'],df['target']), normalize='index').reset_index()
for a slightly cleaner and maybe easier to read code.
Upvotes: 0
Reputation: 210852
you can do the following:
In [131]: s = df.groupby('cat').agg({'target': ['sum', 'count']}).reset_index(level=0)
In [132]: s.columns
Out[132]:
MultiIndex(levels=[['target', 'cat'], ['sum', 'count', '']],
labels=[[1, 0, 0], [2, 0, 1]])
Let's bring order to column names:
In [133]: s.columns = [col[1] if col[1] else col[0] for col in s.columns.tolist()]
In [134]: s
Out[134]:
cat sum count
0 A 2.0 4
1 B 4.0 6
In [135]: s['pctTrue'] = s['sum']/s['count']
In [136]: s['pctFalse'] = 1 - s.pctTrue
In [137]: s
Out[137]:
cat sum count pctTrue pctFalse
0 A 2.0 4 0.500000 0.500000
1 B 4.0 6 0.666667 0.333333
Upvotes: 1