Reputation: 717
Very conveniently RandomForest in R accepts factors for the inputs (X). I assume this makes it easier to build a tree, if from the factor variable with values (a,b,c) you can build a node that splits into (a,c) and (b). In sklearn I need to encode everything as dummies (0,1) so that any relation between the a,b,c vectors is lost.
Is my interpretation correct and is there a way in sklearn to link input vectors?
If I were to encode the variable as (0,1,2) I also assume that sklearn would interpret this as 0 and 1 being close to each other and hence it would seek (eg) a split [0,1] vs [2].
Upvotes: 3
Views: 1401
Reputation: 3196
Scikit-learn, indeed, does not support categorical features without encoding them as numbers. Also, your assumption that sklearn would interpret
as 0 and 1 being close to each other and hence it would seek (eg) a split [0,1] vs 1
is correct. In some cases, this would not necessarily mean that the performance of this encoding is worse compared to One Hot Encoding. You many have to try it on your data.
If you want to stick to python, you have three options:
Upvotes: 2
Reputation: 23637
Consider the factor with three values a, b, c and the corresponding one-hot encoding:
factor a b c
-------- ---------
a 1 0 0
b 0 1 0
c 0 0 1
There are three possibilities to split the factor:
f: a | b c
f: b | a c
f: c | a b
There are three dummy variables with one possible split, each. This again results in three possible ways to split:
a: 1 | 0
b: 1 | 0
c: 1 | 0
For example, splitting variable a in 1 | 0
is equivalent to splitting the factor f in a | b c
. There is an exact correspondence between a factor and one-hot encoding. The relation is not lost and there is no need to explicitly link input vectors.
However, encoding the factor values (a, b, c) as numbers (0, 1, 2) would lose expressive power: There are only two ways to split these numbers: 0 | 1 2
and 0 1 | 2
. So a single node could not represent the split b | a c
with this encoding.
Finally, there is a small catch. When looking for the best split, only a given number of max_features
features are considered (default: sqrt(n_features)
). If the factor gets included all splits are evaluated. In one-hot encoding it is possible that not all splits of a factor are evaluated because each dummy variable is separately selected for inclusion. This may have impact on the resulting trees, but I do not know how severe this may get.
Upvotes: 2