How can I automatically detect if a colum is categorical?

Question

I want to find a category of a pandas column. I can get the type but I'm struggling to figure out categories.

titanic_df = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')

#ID datatype

def idDataTypes(inputDataFrame):
    columnTypesDict = {} 
    import numpy as np
    import numbers
    import pandas as pd
    from pandas.api.types import is_string_dtype
    from pandas.api.types import is_numeric_dtype

    for columns in inputDataFrame.columns.values:
        #print(columns)
        #try to convert to number. If it doesn't work it will convert to another type
        try:
            inputDataFrame[columns] = pd.to_numeric(inputDataFrame[columns], errors='ignore').apply(lambda x: x + 1 if isinstance(x, numbers.Number) else x) 
        except:
            print(columns, " cannot convert.")
        #print(inputDataFrame[columns].dtype)

        #create dictionary with the label
        if is_numeric_dtype(inputDataFrame[columns]): #products[columns].dtype == np.float64:
            columnTypesDict[columns] = "numeric"
        elif is_string_dtype(inputDataFrame[columns]): # products[columns].dtype == np.object:
            columnTypesDict[columns] = "string"
            #print(is_string_dtype(products[columns]))
        else:
            print("something else", prinputDataFrameoducts[columns].dtype)

    #category 
    cols = inputDataFrame.columns
    num_cols = inputDataFrame._get_numeric_data().columns
    #num_cols
    proposedCategory = list(set(cols) - set(num_cols))
    for value in proposedCategory:
        columnTypesDict[value] = "category"

    return(columnTypesDict)

idDataTypes(titanic_df)

The results I'm getting are not what I expect:

{'pclass': 'numeric',
 'survived': 'numeric',
 'name': 'category',
 'sex': 'category',
 'age': 'numeric',
 'sibsp': 'numeric',
 'parch': 'numeric',
 'ticket': 'category',
 'fare': 'numeric',
 'cabin': 'category',
 'embarked': 'category',
 'boat': 'category',
 'body': 'numeric',
 'home.dest': 'category'}

pclass should be a category and name should not be.

I'm not sure how to assess if something is a category or not. Any ideas?

Code Different · Accepted Answer

Here's the bug in your code:

proposedCategory = list(set(cols) - set(num_cols))

Everything other than the numeric columns are to become categories.

There is no right way to do this either, since whether a column is categorical is best decided manually with knowledge of the data the column contains. You are trying to do it automatically. One way to do it is to count the number of unique values in the column. It there are relatively few unique values, the column is likely categorical.

#category 
for name, column in inputDataFrame.iteritems():
    unique_count = column.unique().shape[0]
    total_count = column.shape[0]
    if unique_count / total_count < 0.05:
        columnTypesDict[name] = 'category'

The 5% threshold is random. No column will be identified as categorical if there are fewer than 20 rows in your dataframe. For best result, you will have to adjust that ratio of small and big dataframes.

How can I automatically detect if a colum is categorical?

Answers (2)

Related Questions