Abhishek Prasar
Abhishek Prasar

Reputation: 117

One-hot encoding in random forest classifier

Is one-hot encoding necessary for random forest classifier in python? I want to understand logically if random forest can handle categorical features with label encoding rather that one-hot-encoding.

Upvotes: 5

Views: 17557

Answers (2)

Vibhav Sharma
Vibhav Sharma

Reputation: 80

Random forest is based on the principle of Decision Trees which are sensitive to one-hot encoding. Now here sensitive means like if we induce one-hot to a decision tree splitting can result in sparse decision tree. The trees generally tend to grow in one direction because at every split of a categorical variable there are only two values (0 or 1). The tree grows in the direction of zeroes in the dummy variables.

Decision trees with one-hot and without one-hot

Now you must be wondering how will you tackle the categorical values without one-hot encoding? For that you can refer to this Hashing Trick further you can also look into h2o Random Forest.

Upvotes: 5

Vedant Vasishtha
Vedant Vasishtha

Reputation: 395

The concept of encoding is necessary in machine learning because with the help of it, we can convert non-numeric features into numeric ones which is understandable by any model.

Any type of encoding can be done on any non-numeric features, it solely depends on intution.

Now, coming to your question when to use label-encoding and when to use One-hot encoding:

  1. Use Label-encoding - Use this when, you want to preserve the ordinal nature of your feature. For example, you have a feature of education level, which has string values like "Bachelor","Master","Ph.D". In this case, you want to preserve the ordinal nature that, Ph.D > Master > Bachelor hence you'll map using label-encoding like - Bachelor-1, Master-2, Ph.D-3.
  2. Use One-hot encoding - Use this when, you want to treat your categorical variable with equal order. For example, you have colors variable which has values "red","yellow", "orange". Now, in this case any value has no precedence over other values, hence you'll use One hot encoding here.

NOTE: In One-hot encoding your number of features will increase, which is not good for any tree based algorithm like Decision-trees, Random Forest etc. That's why Label encoding is mostly preferred in this case, but still if you use one hot encoding, you can check the importance of categorical features by using feature_importances_ hyperparameter in sklearn. If the feature is having low importance you can drop it off.

Upvotes: 8

Related Questions