Python strategies for handing categorical variables

I am currently working on a binary classification task where the class is imbalanced.

I have the following categorical attributes with different levels:

time_slot: 8 levels
product_type: 3 levels
state: 40 levels
due_day: 6 levels (Mon - Sat)
lead_time: numerical in days (0-100)

Now, I am planning to use three algorithms to start with:

Logistic Regression, Decision Tree and Random Forest

I am confused as to what sort of encoding strategy is best when it comes to categorical variables?

LabelEncoder, OneHot, BinaryEncoding?

Also, I am thinking of creating bins for lead_time

any pointers/tips will be useful.

Upvotes: 0

Answers (1)

afsharov

Reputation: 5164

I believe there is no concise answer to your question, especially since the specifics of your data set are unknown. It is usually a good idea to try different approaches and see which works best in a particular case. Using scikit-learn, you might want to have a look at the category_encoders library.

In general, all strategies have their perks and peeves. Label encoding for categorical features is in general discouraged as it might artificially introduce an order (especially problematic for algorithms that compute weights like Logistic Regression). One-Hot encoding on the other hand usually increases the dimensionality as all categories are converted into binary features. In your case maybe not so dramatic, but definitely a bad idea if you have several categorical features with high cardinality.

Lastly, you could also just check out some algorithms that can handle categorical features out of the box, like CatBoost or LightGBM.

"Just give them a try" might sound dissatisfying, but I think it is a solid approach.

Upvotes: 1

Python strategies for handing categorical variables

Answers (1)

Related Questions