Reputation: 26027
I want to find a category of a pandas column. I can get the type but I'm struggling to figure out categories.
titanic_df = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')
#ID datatype
def idDataTypes(inputDataFrame):
columnTypesDict = {}
import numpy as np
import numbers
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
for columns in inputDataFrame.columns.values:
#print(columns)
#try to convert to number. If it doesn't work it will convert to another type
try:
inputDataFrame[columns] = pd.to_numeric(inputDataFrame[columns], errors='ignore').apply(lambda x: x + 1 if isinstance(x, numbers.Number) else x)
except:
print(columns, " cannot convert.")
#print(inputDataFrame[columns].dtype)
#create dictionary with the label
if is_numeric_dtype(inputDataFrame[columns]): #products[columns].dtype == np.float64:
columnTypesDict[columns] = "numeric"
elif is_string_dtype(inputDataFrame[columns]): # products[columns].dtype == np.object:
columnTypesDict[columns] = "string"
#print(is_string_dtype(products[columns]))
else:
print("something else", prinputDataFrameoducts[columns].dtype)
#category
cols = inputDataFrame.columns
num_cols = inputDataFrame._get_numeric_data().columns
#num_cols
proposedCategory = list(set(cols) - set(num_cols))
for value in proposedCategory:
columnTypesDict[value] = "category"
return(columnTypesDict)
idDataTypes(titanic_df)
The results I'm getting are not what I expect:
{'pclass': 'numeric',
'survived': 'numeric',
'name': 'category',
'sex': 'category',
'age': 'numeric',
'sibsp': 'numeric',
'parch': 'numeric',
'ticket': 'category',
'fare': 'numeric',
'cabin': 'category',
'embarked': 'category',
'boat': 'category',
'body': 'numeric',
'home.dest': 'category'}
pclass should be a category and name should not be.
I'm not sure how to assess if something is a category or not. Any ideas?
Upvotes: 2
Views: 2140
Reputation: 23
One quick (and lazy) workaround I've found out is using the Pandas .corr() method to automatically slash out numerical columns for you. As per my observation, .corr() automatically selects numerical columns when it returns the pairwise correlations for the entire dataframe. (Provided you have applied it on the entire dataset). Hence you can always linear search for the categorical columns in your original dataframe, if its not in the dataframe returned by .corr(). This might not be 100% effective but it does the job most of the time.
corr_df = df.corr() #returns a dataframe
num_cols = corr_df.columns
cat_cols = [cols for cols in df.columns if not cols in num_cols]
PS : Might be a bit time/memory intensive if dataset contains a lot of columns.
Upvotes: 0
Reputation: 93181
Here's the bug in your code:
proposedCategory = list(set(cols) - set(num_cols))
Everything other than the numeric columns are to become categories.
There is no right way to do this either, since whether a column is categorical is best decided manually with knowledge of the data the column contains. You are trying to do it automatically. One way to do it is to count the number of unique values in the column. It there are relatively few unique values, the column is likely categorical.
#category
for name, column in inputDataFrame.iteritems():
unique_count = column.unique().shape[0]
total_count = column.shape[0]
if unique_count / total_count < 0.05:
columnTypesDict[name] = 'category'
The 5% threshold is random. No column will be identified as categorical if there are fewer than 20 rows in your dataframe. For best result, you will have to adjust that ratio of small and big dataframes.
Upvotes: 2