Reputation: 7733
I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cf = pd.DataFrame({'grade': rng.choice(list('ACD'),size=(8)),
'dash': rng.choice(list('PQRS'),size=(8)),
'dumeel': rng.choice(list('QWER'),size=(8)),
'dumma': rng.choice((1234),size=(8)),
'target': rng.choice([0,1],size=(8))
})
I would like to do the below
a) Find the total
and %total
for each of my value in the categorical columns against the target
column
I tried the below but it only gets me to half way of the results.
cols = cf.select_dtypes('object')
cf.melt('target',cols).groupby(['variable','value']).size().reset_index(name='cnt of records')
How can I use the above result to compute target met
and target not met
details using the target
column?
I expect my output to be like as shown below (note that I have shown only two columns grade
and dash
for sample). Code should follow the same logic for all string columns
Upvotes: 1
Views: 58
Reputation: 120469
Select your columns to flatten with melt
then join the target
column. Finally, group by variable
and value
columns and apply a dict of functions to each group.
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target not met': lambda x: len(x) - sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%",
'target not met %': lambda x: f"{round(100 * (len(x) - sum(x)) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
Output:
>>> out
variable value cnt of records target met target not met target met % target not met %
0 dash Q 2 0 2 0.00% 100.00%
1 dash R 2 2 0 100.00% 0.00%
2 dash S 4 2 2 50.00% 50.00%
3 dumeel E 3 2 1 66.67% 33.33%
4 dumeel Q 3 2 1 66.67% 33.33%
5 dumeel R 1 0 1 0.00% 100.00%
6 dumeel W 1 0 1 0.00% 100.00%
7 grade A 2 0 2 0.00% 100.00%
8 grade C 3 2 1 66.67% 33.33%
9 grade D 3 2 1 66.67% 33.33%
Upvotes: 2
Reputation:
You can use agg
after you groupby
for this:
cols = cf.select_dtypes('object')
df = (
cf.melt('target', cols)
.groupby(['variable','value'])
['target']
.agg([('l', 'size'), ('s', 'sum')]) # l = length (total count of rows in this group), s = sum (total count of rows in the group where target = 1)
.pipe(lambda x: (
x.assign(
met_pct=x.s / x.l * 100,
not_met_pct=100 - (x.s / x.l * 100),
met=x.s,
not_met=x.l - x.s
)
)).reset_index()
.drop(['l', 's'], axis=1)
)
Output:
>>> df
variable value met_pct not_met_pct met not_met
0 dash P 100.000000 0.000000 1 0
1 dash Q 0.000000 100.000000 0 3
2 dash R 50.000000 50.000000 1 1
3 dash S 50.000000 50.000000 1 1
4 dumeel E 0.000000 100.000000 0 1
5 dumeel Q 100.000000 0.000000 1 0
6 dumeel R 50.000000 50.000000 2 2
7 dumeel W 0.000000 100.000000 0 2
8 grade A 0.000000 100.000000 0 1
9 grade C 50.000000 50.000000 2 2
10 grade D 33.333333 66.666667 1 2
Upvotes: 2