Reputation: 16503
In scikit-learn, which models do I need to break categorical variables into dummy binary fields?
For example, if the column is political-party
, and the values are democrat
, republican
and green
, for many algorithms, you have to break this into three columns where each row can only hold one 1
, and all the rest must be 0
.
This avoids enforcing an ordinality that doesn't exist when discretizing [democrat, republican and green]
=> [0, 1, 2]
, since democrat
and green
aren't actually "farther" away then another pair.
For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?
Upvotes: 2
Views: 3923
Reputation: 40159
For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?
All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.
Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.
Upvotes: 4
Reputation: 21914
For any algorithm that does calculations based on vectorized inputs (the large majority of them, but I'm sure there are exceptions), you will need to do some kind of "vectorization". However, you don't have to do it in the way you've explained above.
Since most of the algorithms only care that they are being given a series of numbers mapped to a series other numbers, you are generally able to replace any binary fields with confidence levels if you have that level of granularity.
It's also worth noting that these are not "dummy variables", but are just a different representation. They directly represent your classes. To answer your last question, it can only hurt if you're throwing away information, so turning a classification into a binary vector is totally fine. To put this into more concrete terms:
['republican'] -> [0, 1, 0] # binary vectorization, totally fine
['republican', 'green'] -> [0, 0.5, 0.5] # non-binary vectorization, also totally fine
{'republican': 0.75, 'green': 0.25} -> [0, 1, 0] # information lost, not fine.
Hope that helps, let me know if you have any more questions.
Upvotes: 0