KevOMalley743
KevOMalley743

Reputation: 581

if-else by column dtype in pandas

Formating output from pandas

I'm trying to automate getting output from pandas in a format that I can use with the minimum of messing about in a word processor. I'm using descriptive statistics as a practice case and so I'm trying to use the output from df[variable].describe(). My problem is that .describe() responds differently depending on the dtype of the column (if I'm understanding it properly).

In the case of a numerical column describe() produces this output:

count    306.000000
mean      36.823529
std        6.308587
min       10.000000
25%       33.000000
50%       37.000000
75%       41.000000
max       50.000000
Name: gses_tot, dtype: float64

However, for categorical columns, it produces:

count        306
unique         3
top       Female
freq         166
Name: gender, dtype: object

Because of this difference, I need different code to capture the information I need, however, I can't seem to get my code to work on the categorical variables.

What I've tried

I've tried a few different versions of :

for v in df.columns:
    if df[v].dtype.name == 'category': #i've also tried 'object' here
        c, u, t, f, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'Largest category = {t}')
        print(f'Percentage = {(f/c)*100}%')        
    else:
        c, m, std, mi, tf, f, sf, ma, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'M = {m}')
        print(f'SD = {std}')
        print(f'Range = {float(ma) - float(mi)}')
        print(f'\n')

The code in the else block works fine, but when I come to a categorical column I get the error below

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
      6         print(f'Percentage = {(f/c)*100}')
      7     else:
----> 8         c, m, std, mi, tf, f, sf, ma, = df[v].describe()
      9         print(f'******{str(v)}******')
     10         print(f'M = {m}')

ValueError: not enough values to unpack (expected 8, got 4)

What I want to happen is something like

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


******gender******
Largest category = female
Percentage = 52.2%


I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work. 

Advice would be much appreciated.

Upvotes: 0

Views: 256

Answers (1)

Stef
Stef

Reputation: 30609

You can check what fields are included in the output of describe and print the corresponding sections:

import pandas as pd

df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})

for v in df.columns:
    desc = df[v].describe()
    print(f'******{str(v)}******')
    if 'top' in desc:
        print(f'Largest category = {desc["top"]}')
        print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')        
    else:
        print(f'M = {desc["mean"]}')
        print(f'SD = {desc["std"]}')
        print(f'Range = {float(desc["max"]) - float(desc["min"])}')

Upvotes: 1

Related Questions