Random Forest - Max Features

Question

I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)

After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features

I found the following page answer, where it is written

features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression

I do get different results if I use max_features=sqrt(n) or max_features=n_features

Can any1 give me a good explanation how to approach this parameter?

That would be really great

Davide ND · Accepted Answer

max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.

Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features.
Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.

For safety, keep the n_estimators on the high end.

PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.

Random Forest - Max Features

Answers (1)

Related Questions