How could I detect subtypes in pandas object columns?

I have the next DataFrame:

df = pd.DataFrame({'a': [100, 3,4], 'b': [20.1, 2.3,45.3], 'c': [datetime.time(23,52), 30,1.00]})

and I would like to detect subtypes in columns without explicit programming a loop, if possible.

I am looking for the next output:

column a = [int]
column b = [float]
column c = [datetime.time, int, float]

Upvotes: 9

Views: 6486

Answers (5)

Berel Levy
Berel Levy

Reputation: 121

df.applymap(type).apply(set)

If you only want to check columns with dtype of object, use:

df.select_dtypes(object).applymap(type).apply(set)

Your output will look something like:

column_a                                 {<class 'str'>}
column_b                {<class 'str'>, <class 'float'>}
column_c    {<class 'decimal.Decimal'>, <class 'float'>}

Explanation:

applymap will replace the value of each and every cell in the df with it's python type,

apply will then place all the values of each column into a python set object, which is like an array but duplicates are not allowed

Upvotes: 1

JPvRiel
JPvRiel

Reputation: 231

@jpp's answer was helpful.

I expanded on it and wanted to show how the dtype to python type (py_type) related more explicitly, as wall as display the shorthand numpy kind, and tabulated as a metadata dataframe:

import datetime
import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [100, 3, 4], 'b': [20.1, 2.3, 45.3], 'c': [datetime.time(23, 52), 30, 1.00], 'd': ['s1', 's2', 's3']})

types_df = {
    c: {
        'dtype': df[c].dtype,
        'np_kind': df[c].dtype.kind if isinstance(df[c].dtype, np.dtype) else None,
        'py_types': set(map(type, df[c])) if df[c].dtype == np.dtype('O') else {df[c].dtype.type}
    }
    for c in df.columns
}

pd.DataFrame.from_dict(types_df, orient='index')

Upvotes: 1

Matthew R. DeVerna
Matthew R. DeVerna

Reputation: 31

Just wanted to provide what I found to be a more readable version...

Load your packages and create the dataframe

# Packages
import pandas as pd
import datetime

# DataFrame  
df = pd.DataFrame({'a': [100, 3,4], 'b': [20.1, 2.3,45.3], 'c': [datetime.time(23,52), 30,1.00]})

# Map over each column individually, within a print
print("column a =", df.a.map(type).unique())
print("column b =", df.b.map(type).unique())
print("column c =", df.c.map(type).unique())

# Outputs:
column a = [<class 'int'>]
column b = [<class 'float'>]
column c = [<class 'datetime.time'> <class 'int'> <class 'float'>]

Likely unnecessary (and a bit more complicated), but would help you remove the class and < > characters is the following...###

# Use `.__name__` within a list comprehension to access only the type name
print("column a =", [x.__name__ for x in df.a.map(type).unique()])
print("column b =", [x.__name__ for x in df.b.map(type).unique()])
print("column c =", [x.__name__ for x in df.c.map(type).unique()])

# Outputs:
column a = ['int']
column b = ['float']
column c = ['time', 'int', 'float']

While this is repetitive, and I know that repetition in code is often frowned upon, it is much simpler to understand if you were sharing this code with someone else (at least to me) and, thus, more valuable (again in my opinion).

Upvotes: 3

Abhi
Abhi

Reputation: 4233

You can just use python built-in function map.

column_c = list(map(type,df['c']))
print(column_c)

output:
[datetime.time, int, float]

types = {i: set(map(type, df[i])) for i in df.columns} 
# this will return unique dtypes of all columns in a dict

Upvotes: 2

jpp
jpp

Reputation: 164623

You should appreciate that with Pandas you can have 2 broad types of series:

  1. Optimised structures: Usually numeric data, this includes np.datetime64 and bool.
  2. object dtype: Used for series with mixed types or types which cannot be held natively in a NumPy array. The series is structured as a sequence of pointers to arbitrary Python objects and is generally inefficient.

The reason for this preamble is you should only ever need to apply element-wise logic to the second type. Data in the first category is homogeneous by nature.

So you should separate your logic accordingly.

Regular dtypes

Use pd.DataFrame.dtypes:

print(df.dtypes)

a      int64
b    float64
c     object
dtype: object

object dtype

Isolate these series via pd.DataFrame.select_dtypes and then use a dictionary comprehension:

obj_types = {col: set(map(type, df[col])) for col in df.select_dtypes(include=[object])}

print(obj_types)

{'c': {int, datetime.time, float}}

You will need to do a little more work to get the exact format you require, but the above should be your plan of attack.

Upvotes: 12

Related Questions