Reputation: 6488
I am working on a ML problem to predict house prices and Zip Code
is one feature which will be useful. I am also trying to use Random Forest Regressor
to predict the log
of the price
.
However, should I use One Hot Encoding
or Label Encoder
for Zip Code
? Because I have about 2000 Zip Codes
in my dataset and performing One Hot Encoding
will expand the columns significantly.
To rephrase: does it make sense to use LabelEncoder
instead of One Hot Encoding
on Zip Codes
Upvotes: 1
Views: 1129
Reputation: 6649
Like the link says:
LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.
And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.
Upvotes: 2