The Great
The Great

Reputation: 7733

Find column value wise total against another column using pandas

I have a dataframe like as shown below

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)

cf = pd.DataFrame({'grade': rng.choice(list('ACD'),size=(8)),
                       'dash': rng.choice(list('PQRS'),size=(8)),
                       'dumeel': rng.choice(list('QWER'),size=(8)),
                       'dumma': rng.choice((1234),size=(8)),
                       'target': rng.choice([0,1],size=(8))
})

I would like to do the below

a) Find the total and %total for each of my value in the categorical columns against the target column

I tried the below but it only gets me to half way of the results.

cols = cf.select_dtypes('object')
cf.melt('target',cols).groupby(['variable','value']).size().reset_index(name='cnt of records')

How can I use the above result to compute target met and target not met details using the target column?

I expect my output to be like as shown below (note that I have shown only two columns grade and dash for sample). Code should follow the same logic for all string columns

enter image description here

Upvotes: 1

Views: 58

Answers (2)

Corralien
Corralien

Reputation: 120469

Select your columns to flatten with melt then join the target column. Finally, group by variable and value columns and apply a dict of functions to each group.

funcs = {
  'cnt of records': 'count',
  'target met': lambda x: sum(x),
  'target not met': lambda x: len(x) - sum(x),
  'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%",
  'target not met %': lambda x: f"{round(100 * (len(x) - sum(x)) / len(x), 2):.2f}%"
}

out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
        .groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()

Output:

>>> out
  variable value  cnt of records  target met  target not met target met % target not met %
0     dash     Q               2           0               2        0.00%          100.00%
1     dash     R               2           2               0      100.00%            0.00%
2     dash     S               4           2               2       50.00%           50.00%
3   dumeel     E               3           2               1       66.67%           33.33%
4   dumeel     Q               3           2               1       66.67%           33.33%
5   dumeel     R               1           0               1        0.00%          100.00%
6   dumeel     W               1           0               1        0.00%          100.00%
7    grade     A               2           0               2        0.00%          100.00%
8    grade     C               3           2               1       66.67%           33.33%
9    grade     D               3           2               1       66.67%           33.33%

Upvotes: 2

user17242583
user17242583

Reputation:

You can use agg after you groupby for this:

cols = cf.select_dtypes('object')
df = (
    cf.melt('target', cols)
        .groupby(['variable','value'])
        ['target']
        .agg([('l', 'size'), ('s', 'sum')]) # l = length (total count of rows in this group), s = sum (total count of rows in the group where target = 1)
        .pipe(lambda x: (
            x.assign(
                met_pct=x.s / x.l * 100,
                not_met_pct=100 - (x.s / x.l * 100),
                met=x.s,
                not_met=x.l - x.s
            )
        )).reset_index()
        .drop(['l', 's'], axis=1)
)

Output:

>>> df
   variable value     met_pct  not_met_pct  met  not_met
0      dash     P  100.000000     0.000000    1        0
1      dash     Q    0.000000   100.000000    0        3
2      dash     R   50.000000    50.000000    1        1
3      dash     S   50.000000    50.000000    1        1
4    dumeel     E    0.000000   100.000000    0        1
5    dumeel     Q  100.000000     0.000000    1        0
6    dumeel     R   50.000000    50.000000    2        2
7    dumeel     W    0.000000   100.000000    0        2
8     grade     A    0.000000   100.000000    0        1
9     grade     C   50.000000    50.000000    2        2
10    grade     D   33.333333    66.666667    1        2

Upvotes: 2

Related Questions