Bhargavi N
Bhargavi N

Reputation: 43

Random forest sklearn

I am confused if explicit cross validation is necessary for Random Forest? In random forest we have Out of Bag samples and this can be used for computing test accuracy. Is explicit cross validation necessary. Is there any benefit of explicitly using CV in Random forest? I find it confusing to understand how CV in Random forest works based on this code Here is my code:

model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

results = cross_validation.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Upvotes: 3

Views: 1251

Answers (6)

"I am confused if explicit cross validation is necessary for Random Forest?"

Answer: no cross validation is not required for random forest to work. However, cross validation will prove whether your model can generalize well. It is good practice to include cross validation performance indicators as part of the pipeline then use the parameters of the cv. The downside is cross validation takes resource and time to complete and for small datasets the payoff may not be signficant.

Upvotes: 0

CHAVDA MEET
CHAVDA MEET

Reputation: 935

#What is Cross Validation?#

Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.

Here are the steps involved in cross validation:

 1. You reserve a sample data set
 2. Train the model using the remaining part of the dataset Use
 3.the reserve sample of the test (validation) set. This will help you in gauging 
   the effectiveness of your model’s performance. If your model delivers a  
   result on validation data, go ahead with the current model. It rocks!

Upvotes: 0

Vivek
Vivek

Reputation: 161

oob_score is calculated when you do bootstrap sampling of your dataset while training the random forest model. This is a hyper parameter that one can control. Do note that OOB refers to the portion of the dataset which bootstrap algorithm. however the difference in the underlying manner in which bootstrapping itself work which leads to redudancies in records being consider in the train sample and out of sample as opposed to the train/test split that k-fold cross validation would do

Do note that one place where OOB can be particularly useful over cross validation is when using Random Forest or any bagging classifier for that matter on a large dataset when cross validation might turn computationally more consuming

Upvotes: 1

prashanth
prashanth

Reputation: 4485

OOB error is an unbiased estimate of the prediction error from the Random forest. So reporting OOB error is sufficient. However,there is no harm in performing cross validation. Ideally both should be very close.

Upvotes: 0

Troy
Troy

Reputation: 548

For each row in the dataset, the OOB_score is calculated using only a subset of the trees in the random forest. So it is not truly reflective of how the full random forest model would perform on any particular data row. So the short answer is: you still need an explicit validation set because the score of the model calculated on that set (be it R2, mean squared error etc.) is on predictions made by the full model.

I'll illustrate with an (over-simplified) example:

Suppose we're doing regression for predicting house sale prices. You have a dataset of 5 rows (each row would contain the features of a particular house like its YearBuilt, LivingArea, NumberOfGarages for example), and a random forest with 3 trees.

Tree 1     Tree 2     Tree 3
------     ------     ------
  1                      1
  2           2       
              3          3
  4                      
              5          5

where the numbers 1-5 denote the dataset row number used for training the tree (selected by bootstrapping). Then for example, rows 3 and 5 are not used in training Tree 1, and so forth. Everything up to here is done regardless of whether you set OOB_score true or not in sklearn's RandomForest.

OOB

If OOB_score is set to true: we go through each row in the dataset, and do the following.

  • Row 1: only Tree 2 did not use it for fitting/training. Hence, we predict and get the score for row 1 using only Tree 2.
  • Row 2: only Tree 3 did not use it for training. Hence, we predict and get the score for row 2 using only Tree 3.
  • ...
  • Row 4: Trees 2 and 3 did not use it for training. The predicted sale price for this house will be the average of the predictions by Tree 2 and Tree 3, but not Tree 1.

The OOB_score is just the average of the scores of the predictions across all the rows.

Validation set

Compare this behaviour to if you had used an explicit validation set. You would have (for example) 5 new rows in your validation dataset, and for each row you would pass it through ALL 3 trees in the forest, get their individual predictions and report the average as the final prediction of the sale price for that row.

Then you could report the mean squared error (or any other metric) on the entire validation set by taking the averages of the errors across all the rows.

Summary

To summarize,

  • in calculating the OOB_score, each row is only predicted on a subset of trees in the forest.

  • Whereas the score reported on an explicit validation set is as a result of predicting each row with all trees in your forest, which is more representative of what actually happens in the test set. And so, this option is the one you want.

  • On average, you would expect the OOB_score to be slightly worse than if you did on an explicit validation set because you use less trees to predict in the former.

Comment

That said, the two scores (OOB vs validation) are often quite close in practice. If you have a small dataset and can't afford a validation set, OOB offers a good alternative.

But if your dataset is large enough, I would recommend setting aside an explicit validation set (or do cross-validation) anyway. In this case, the OOB_score is just an additional metric for which to judge your model by, but you should place a higher priority on the validation score. Work on improving this score.


A second reason

There is also another case in which an explicit validation set is more suitable than using OOB: when there is a time-series involved.

An example is the Kaggle competition called Corporación Favorita Grocery Sales Forecasting, where your goal is to predict grocery prices per item per store over the next 2 weeks based on given data from the past 4 years.

In this case, your model (when it is done) will predict future prices in the test set. Hence, you want to simulate this as far as possible when validating your model. What this means is:

You want construct the validation set to be as recent as possible (using data from the previous 2 weeks, for example). Then, your training dataset is from 4 years ago till 2 weeks before 'today'. And you validate on the "future" validation set, from 2 weeks ago till 'today'.

You cannot do this if you use OOB_score: it generates the pseudo-validation set on random (consequence of bootstrapping), so the score you get from OOB will be less meaningful since you are not simulating the "future" effect described above. Generating an explicit validation set would allow you the freedom for selecting the most recent data you have for validation, instead of random.

Upvotes: 3

andrewchauzov
andrewchauzov

Reputation: 1009

oob_score is not enough because:

  1. it is calculated on some rows from the train set, which does not show your predictive power
  2. I do not see any cv/stratification inside oob_score calculation (so if you have unbalanced dataset and oob_score rows are taken completely at random - it is bad)
  3. it uses accuracy_score for classification and r2 for regression which may be not your desired metrics

Upvotes: 0

Related Questions