Reputation: 5527
How is it possible to control the size of the subsample used for the training of each tree in the forest? According to the documentation of scikit-learn:
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
So bootstrap
allows randomness but can't find how to control the number of subsample.
Upvotes: 4
Views: 3838
Reputation: 11
In version 0.22, scikit-learn added the parameter max_samples
that can be tuned; from the docs:
The sub-sample size is controlled with the
max_samples
parameter ifbootstrap=True
(default), otherwise the whole dataset is used to build each tree.
Upvotes: 1
Reputation: 26
You can actually modify _generate_sample_indices
function in forest.py to change the size of subsample each time. The fastai
library has actually implemented a function set_rf_samples
for that purpose; it looks like that:
def set_rf_samples(n):
""" Changes Scikit learn's random forests to give each tree a random sample of
n random rows.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n))
you could add this function to your code
Upvotes: 1
Reputation: 8528
Scikit-learn doesn't provide this, but you can easily get this option by using (slower) version using combination of tree and bagging meta-classifier:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)
As a side-note, Breiman's random forest indeed doesn't consider subsample as a parameter, completely relying on bootstrap, so approximately (1 - 1 / e) of samples are used to build each tree.
Upvotes: 3