Random forest in R and sklearn

Very conveniently RandomForest in R accepts factors for the inputs (X). I assume this makes it easier to build a tree, if from the factor variable with values (a,b,c) you can build a node that splits into (a,c) and (b). In sklearn I need to encode everything as dummies (0,1) so that any relation between the a,b,c vectors is lost.

Is my interpretation correct and is there a way in sklearn to link input vectors?

If I were to encode the variable as (0,1,2) I also assume that sklearn would interpret this as 0 and 1 being close to each other and hence it would seek (eg) a split [0,1] vs [2].

Upvotes: 3

Answers (2)

Stergios

Reputation: 3196

Scikit-learn, indeed, does not support categorical features without encoding them as numbers. Also, your assumption that sklearn would interpret

as 0 and 1 being close to each other and hence it would seek (eg) a split [0,1] vs 1

is correct. In some cases, this would not necessarily mean that the performance of this encoding is worse compared to One Hot Encoding. You many have to try it on your data.

If you want to stick to python, you have three options:

Transform the categorical variables to numeric ones as you described
Use rpy2 package to leverage R libraries through python
Use some other python libraries that support categorical features. The most notable ones are LightGBM and the recently released CatBoost (i.e. Categorical Boosting). Please note that both packages implement GBMs and not Random Forests.

Upvotes: 2

MB-F

Reputation: 23647

Consider the factor with three values a, b, c and the corresponding one-hot encoding:

 factor        a b c
--------     ---------
   a           1 0 0
   b           0 1 0
   c           0 0 1

There are three possibilities to split the factor:

f: a | b c
f: b | a c
f: c | a b

There are three dummy variables with one possible split, each. This again results in three possible ways to split:

a: 1 | 0
b: 1 | 0
c: 1 | 0

For example, splitting variable a in 1 | 0 is equivalent to splitting the factor f in a | b c. There is an exact correspondence between a factor and one-hot encoding. The relation is not lost and there is no need to explicitly link input vectors.

However, encoding the factor values (a, b, c) as numbers (0, 1, 2) would lose expressive power: There are only two ways to split these numbers: 0 | 1 2 and 0 1 | 2. So a single node could not represent the split b | a c with this encoding.

Finally, there is a small catch. When looking for the best split, only a given number of max_features features are considered (default: sqrt(n_features)). If the factor gets included all splits are evaluated. In one-hot encoding it is possible that not all splits of a factor are evaluated because each dummy variable is separately selected for inclusion. This may have impact on the resulting trees, but I do not know how severe this may get.

Upvotes: 2

Random forest in R and sklearn

Answers (2)

Related Questions