Reputation: 127
I want to build an algorithm that can detect which columns are categorical in a dataframe and which are numerical.
Let's have a look at this dataset (just as an example) :
df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})
I can assume that categorical data are object/category types :
df.dtypes()
As we can see the "nationality" column is detected as an object type which is great. The problem is that the "ID" column and the "passengerClass" are detected as int64 type but are categorical.
Is there a way to detect that these columns are also categorical? (I also thought about unique values but if we measure the speed of lots of cars it won't be the same at any time. Same for increasing values because sometimes we can delete rows and the ids won't be in the right order)
Zero proposed that: https://stackoverflow.com/a/29803290/13919003 But in his answer, he doesn't care about int or float being categorical columns which is the case in the « passengerClass » column.
Upvotes: 2
Views: 1626
Reputation: 3663
You can try this:
df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})
df = df.astype('category')
print(df.dtypes)
Output:
ID category
passengerClass category
nationality category
dtype: object
Note:
In the above example, all the columns are converted to "category", but you can explicitly specify
dtype
for individual columns.
----- Alternative Option -----
You can create config file to explicitly specify columns name with dtype:
Config File:
[
{
"columnName": "ID",
"columnDtype": "category"
},
{
"columnName": "passengerClass",
"columnDtype": "category"
},
{
"columnName": "nationality",
"columnDtype": "category"
}
]
Code:
df = pd.DataFrame({"ID": [12324, 26342, 62438], "passengerClass": [1, 2, 2], "nationality": ["FR", "ES", "US"]})
with open('./config.json') as cf:
configList = json.load(cf)
for col in configList:
colName = col['columnName']
colType = col['columnDtype']
df[colName] = df[colName].astype(colType)
print(df.dtypes)
Upvotes: 2