Reputation: 117
Does anyone know if the data type of a variable plays a (negative) role when running a machine learning algorithm in ski kit learn?
Here's a little background that may influence responses to this question: I have a 299 variable dataset where the output variable is a dummy variable. This will be a classification problem and I would like to try different options like logistic regression and tree based models. When I imported my dataset with pandas, I noticed that some of the variables were assigned a data type of int64 when, in fact, they are categorical variables. Is this going to be a problem for the machine learning algorithm? Please forgive me if this is a silly question...I am still relatively new to the machine learning world and while I have not seen anything in the literature on this topic, I did want to make sure I don't go off track before I even start.
Upvotes: 3
Views: 1580
Reputation: 6524
It will be for scikit-learn, as scikit-learn does not support categorical features. It will end up treating that integer values as a numeric feature, and will not behave as you might hope. It does support re-encoding them in a numeric form (see here ), however that is sub-optimal compared to using a library and algorithms that naturally support numeric and categorical features.
Upvotes: 2