Peng He
Peng He

Reputation: 2213

What's the difference between dummy variable and one-hot encoding?

I'm making features for a machine learning model. I'm confused with dummy variable and one-hot encoding.For a instance,a category variable 'week' range 1-7.When using one-hot encoding, encode week = 1 as 1,000,000,week = 2 is 0,100,000... .But I can also make a dummy variable 'week_v',and in this way, I must set a hidden variable which means base variable,and feature week_v = 1 is 100,000,week_v = 2 is 010,000... and does not appear week_v = 7.So what's the difference between them? I'm using logistic model and then I'll try gbdt.

Upvotes: 14

Views: 17471

Answers (2)

Ravi
Ravi

Reputation: 3217

Technically 6- a day week is enough to provide a unique mapping for a vocabulary of size 7:

 1. Sunday    [0,0,0,0,0,0]
 2. Monday    [1,0,0,0,0,0]
 3. Tuesday   [0,1,0,0,0,0]
 4. Wednesday [0,0,1,0,0,0]
 5. Thursday  [0,0,0,1,0,0]
 6. Friday    [0,0,0,0,1,0]
 7. Saturday  [0,0,0,0,0,1]

dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent.

Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros.

 1. Sunday    [0,0,0,0,0,0,1]
 2. Monday    [0,0,0,0,0,1,0]
 3. Tuesday   [0,0,0,0,1,0,0]
 4. Wednesday [0,0,0,1,0,0,0]
 5. Thursday  [0,0,1,0,0,0,0]
 6. Friday    [0,1,0,0,0,0,0]
 7. Saturday  [1,0,0,0,0,0,0]

 for missing values : [0,0,0,0,0,0,0]

Upvotes: 4

jbndlr
jbndlr

Reputation: 5210

In fact, there is no difference in the effect of the two approaches (rather wordings) on your regression.

In either case, you have to make sure that one of your dummies is left out (i.e. serves as base assumption) to avoid perfect multicollinearity among the set.

For instance, if you want to take the weekday of an observation into account, you only use 6 (not 7) dummies assuming the one left out to be the base variable. When using one-hot encoding, your weekday variable is present as a categorical value in one single column, effectively having the regression use the first of its values as the base.

Upvotes: 11

Related Questions