Reputation: 2213
I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable 'week'
range 1-7.When using one-hot encoding, encode week = 1
as 1,000,000,week = 2
is 0,100,000... .But I can also make a dummy variable 'week_v'
,and in this way, I must set a
hidden variable
which means base variable,and feature week_v = 1
is 100,000,week_v = 2
is 010,000... and
does not appear week_v = 7
.So what's the difference between them? I'm using logistic model and then I'll try gbdt.
Upvotes: 14
Views: 17471
Reputation: 3217
Technically 6- a day week is enough to provide a unique mapping for a vocabulary of size 7:
1. Sunday [0,0,0,0,0,0]
2. Monday [1,0,0,0,0,0]
3. Tuesday [0,1,0,0,0,0]
4. Wednesday [0,0,1,0,0,0]
5. Thursday [0,0,0,1,0,0]
6. Friday [0,0,0,0,1,0]
7. Saturday [0,0,0,0,0,1]
dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent.
Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros.
1. Sunday [0,0,0,0,0,0,1]
2. Monday [0,0,0,0,0,1,0]
3. Tuesday [0,0,0,0,1,0,0]
4. Wednesday [0,0,0,1,0,0,0]
5. Thursday [0,0,1,0,0,0,0]
6. Friday [0,1,0,0,0,0,0]
7. Saturday [1,0,0,0,0,0,0]
for missing values : [0,0,0,0,0,0,0]
Upvotes: 4
Reputation: 5210
In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression.
In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set.
For instance, if you want to take the weekday
of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your weekday
variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.
Upvotes: 11