Reputation: 382
I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
Upvotes: 0
Views: 2152
Reputation: 994
max_features
is a parameter that needs to be tuned. Values such as sqrt
or n/3
are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features
, trying every possible value from 2 to the total number of your features.
Train your RandomForestRegressor
with oob_score=True
and use oob_score_
to assess the performance of the Forest. Once you have looped over all possible values of max_features
, keep the one that obtained the highest oob_score
.
For safety, keep the n_estimators
on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.
Upvotes: 1