The Difference between One Hot Encoding and LabelEncoder?

I am working on a ML problem to predict house prices and Zip Code is one feature which will be useful. I am also trying to use Random Forest Regressor to predict the log of the price.

However, should I use One Hot Encoding or Label Encoder for Zip Code? Because I have about 2000 Zip Codes in my dataset and performing One Hot Encoding will expand the columns significantly.

https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor

To rephrase: does it make sense to use LabelEncoder instead of One Hot Encoding on Zip Codes

Upvotes: 1

Answers (1)

Ankur Sinha

Reputation: 6649

Like the link says:

LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.

Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:

cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0 
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1

There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.

**PS:** Give 2 power 11 is 2048 and you have 2000 categories for zipcodes, you can reduce your feature columns to 11 instead of 1999 in the case of one hot encoding!

Upvotes: 2

The Difference between One Hot Encoding and LabelEncoder?

Answers (1)

Related Questions