Detect which columns are categorical in a dataframe with Python

Question

I want to build an algorithm that can detect which columns are categorical in a dataframe and which are numerical.

Let's have a look at this dataset (just as an example) :

df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})

I can assume that categorical data are object/category types :

df.dtypes()

As we can see the "nationality" column is detected as an object type which is great. The problem is that the "ID" column and the "passengerClass" are detected as int64 type but are categorical.

Is there a way to detect that these columns are also categorical? (I also thought about unique values but if we measure the speed of lots of cars it won't be the same at any time. Same for increasing values because sometimes we can delete rows and the ids won't be in the right order)

Zero proposed that: https://stackoverflow.com/a/29803290/13919003 But in his answer, he doesn't care about int or float being categorical columns which is the case in the « passengerClass » column.

Soumendra Mishra · Accepted Answer

You can try this:

df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})
df = df.astype('category')
print(df.dtypes)

Output:

ID                category
passengerClass    category
nationality       category
dtype: object

Note:

In the above example, all the columns are converted to "category", but you can explicitly specify dtype for individual columns.

----- Alternative Option -----

You can create config file to explicitly specify columns name with dtype:

Config File:

[
  {
    "columnName": "ID",
    "columnDtype": "category"
  },
  {
    "columnName": "passengerClass",
    "columnDtype": "category"
  },
  {
    "columnName": "nationality",
    "columnDtype": "category"
  }
]

Code:

df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})

with open('./config.json') as cf:
    configList = json.load(cf)

for col in configList:
    colName = col['columnName']
    colType = col['columnDtype']
    df[colName] = df[colName].astype(colType)

print(df.dtypes)

Detect which columns are categorical in a dataframe with Python

Answers (1)

Related Questions