Reputation: 43
I am confused if explicit cross validation is necessary for Random Forest? In random forest we have Out of Bag samples and this can be used for computing test accuracy. Is explicit cross validation necessary. Is there any benefit of explicitly using CV in Random forest? I find it confusing to understand how CV in Random forest works based on this code Here is my code:
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Upvotes: 3
Views: 1251
Reputation: 4233
"I am confused if explicit cross validation is necessary for Random Forest?"
Answer: no cross validation is not required for random forest to work. However, cross validation will prove whether your model can generalize well. It is good practice to include cross validation performance indicators as part of the pipeline then use the parameters of the cv. The downside is cross validation takes resource and time to complete and for small datasets the payoff may not be signficant.
Upvotes: 0
Reputation: 935
#What is Cross Validation?#
Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.
Here are the steps involved in cross validation:
1. You reserve a sample data set
2. Train the model using the remaining part of the dataset Use
3.the reserve sample of the test (validation) set. This will help you in gauging
the effectiveness of your model’s performance. If your model delivers a
result on validation data, go ahead with the current model. It rocks!
Upvotes: 0
Reputation: 161
oob_score is calculated when you do bootstrap sampling of your dataset while training the random forest model. This is a hyper parameter that one can control. Do note that OOB refers to the portion of the dataset which bootstrap algorithm. however the difference in the underlying manner in which bootstrapping itself work which leads to redudancies in records being consider in the train sample and out of sample as opposed to the train/test split that k-fold cross validation would do
Do note that one place where OOB can be particularly useful over cross validation is when using Random Forest or any bagging classifier for that matter on a large dataset when cross validation might turn computationally more consuming
Upvotes: 1
Reputation: 4485
OOB error is an unbiased estimate of the prediction error from the Random forest. So reporting OOB error is sufficient. However,there is no harm in performing cross validation. Ideally both should be very close.
Upvotes: 0
Reputation: 548
For each row in the dataset, the OOB_score
is calculated using only a subset of the trees in the random forest. So it is not truly reflective of how the full random forest model would perform on any particular data row. So the short answer is: you still need an explicit validation set because the score of the model calculated on that set (be it R2, mean squared error etc.) is on predictions made by the full model.
I'll illustrate with an (over-simplified) example:
Suppose we're doing regression for predicting house sale prices. You have a dataset of 5 rows (each row would contain the features of a particular house like its YearBuilt
, LivingArea
, NumberOfGarages
for example), and a random forest with 3 trees.
Tree 1 Tree 2 Tree 3
------ ------ ------
1 1
2 2
3 3
4
5 5
where the numbers 1-5 denote the dataset row number used for training the tree (selected by bootstrapping). Then for example, rows 3 and 5 are not used in training Tree 1, and so forth.
Everything up to here is done regardless of whether you set OOB_score
true or not in sklearn's RandomForest.
If OOB_score
is set to true: we go through each row in the dataset, and do the following.
The OOB_score
is just the average of the scores of the predictions across all the rows.
Compare this behaviour to if you had used an explicit validation set. You would have (for example) 5 new rows in your validation dataset, and for each row you would pass it through ALL 3 trees in the forest, get their individual predictions and report the average as the final prediction of the sale price for that row.
Then you could report the mean squared error (or any other metric) on the entire validation set by taking the averages of the errors across all the rows.
To summarize,
in calculating the OOB_score
, each row is only predicted on a subset of trees in the forest.
Whereas the score reported on an explicit validation set is as a result of predicting each row with all trees in your forest, which is more representative of what actually happens in the test set. And so, this option is the one you want.
On average, you would expect the OOB_score
to be slightly worse than if you did on an explicit validation set because you use less trees to predict in the former.
That said, the two scores (OOB vs validation) are often quite close in practice. If you have a small dataset and can't afford a validation set, OOB offers a good alternative.
But if your dataset is large enough, I would recommend setting aside an explicit validation set (or do cross-validation) anyway. In this case, the OOB_score
is just an additional metric for which to judge your model by, but you should place a higher priority on the validation score. Work on improving this score.
There is also another case in which an explicit validation set is more suitable than using OOB: when there is a time-series involved.
An example is the Kaggle competition called Corporación Favorita Grocery Sales Forecasting, where your goal is to predict grocery prices per item per store over the next 2 weeks based on given data from the past 4 years.
In this case, your model (when it is done) will predict future prices in the test set. Hence, you want to simulate this as far as possible when validating your model. What this means is:
You want construct the validation set to be as recent as possible (using data from the previous 2 weeks, for example). Then, your training dataset is from 4 years ago till 2 weeks before 'today'. And you validate on the "future" validation set, from 2 weeks ago till 'today'.
You cannot do this if you use OOB_score
: it generates the pseudo-validation set on random (consequence of bootstrapping), so the score you get from OOB will be less meaningful since you are not simulating the "future" effect described above.
Generating an explicit validation set would allow you the freedom for selecting the most recent data you have for validation, instead of random.
Upvotes: 3
Reputation: 1009
oob_score is not enough because:
Upvotes: 0