Reputation: 21
I am currently working on Optuna library and I have seen that there is a parameter which allows to prune unpromissing trials. It seems that this parameter can only be used with incremental learning method such as SGD classifier or with neural network. Hence, I was wondering, is it possible to use prune trials when using a random forest, a decision dree CART or even a logistic regression ?
Thanks a lot ! :)
PS : I did not find any example on the internet which used random forest with pruned trials using optuna ...
Upvotes: 1
Views: 305
Reputation: 5010
SGDClassifier
with loss='cross_entropy'
performs logistic regression, enabling you to use incremental learning for logistic regression.
As for random forests and decsion trees; they are batch learners, so trial pruning doesn't apply. However, you can wrap batch learners in a class (PseudoIncrementalBatchLearner
below) that refits the learner on more and more data each time you call partial_fit()
. This is similar to how a learning curve is generated, where the estimator is refitted on increasing portions of the dataset.
In general, as a learner is fit on larger portions of the data, its generalisation error will go down - the trend is somewhat predictable for an individual estimator. However, when comparing learners, you might like to prune ones that are relatively slow to improve, and which are too expensive to train on the entire dataset...this is where PsueoIncrementalBatchLearner
below could be useful.
The data + example below shows how the blue random forest is slow to improve compared to the orange one, and therefore the blue one is a candidate for early pruning. This avoids you having to train the learner on the full dataset (although at the end they're comparable).
from sklearn import model_selection
from sklearn import base
import numpy as np
#
#Wraps a batch learner, training it on larger portions of the data
# each time partial_fit() is called
#
class PseudoIncrementalBatchLearner(
base.BaseEstimator,
base.MetaEstimatorMixin,
base.ClassifierMixin,
base.RegressorMixin
):
def __init__(self, estimator, max_steps=20, random_state=None):
self.estimator = estimator
self.max_steps = max_steps
self.random_state = random_state
def partial_fit(self, X, y):
X, y = base.check_X_y(X, y)
if hasattr(X, 'columns'):
self.feature_names_in_ = np.array(
X.columns, dtype='object'
)
self.n_features_in_ = X.shape[1]
if not hasattr(self, 'current_step_'):
self.current_step_ = 0
#Get ShuffleSplit/StratifiedShuffleSplit for regressor/classifier
cv = getattr(
model_selection,
('Stratified' if base.is_classifier(self.estimator) else '') + 'ShuffleSplit'
)
#Shuffle and split off the required size for this step
if self.current_step_ + 1 < self.max_steps:
train_ix, _ = next(cv(
n_splits=1, train_size=(self.current_step_ + 1) / self.max_steps
).split(X, y))
else:
train_ix = np.arange(len(X))
#Beyond max_steps, no more refitting, as already fit on all data.
# Could optionally comment this part out.
if self.current_step_ + 1 > self.max_steps:
return self
#Refit estimator on the current portion of the dataset
self.estimator_ = base.clone(self.estimator).fit(X[train_ix], y[train_ix])
self.current_step_ += 1
return self
def predict(self, X):
return self.estimator_.predict(X)
#
#Make test dataset
#
from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.2, random_state=0)
X_val, y_val = make_moons(n_samples=200, noise=0.2, random_state=1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.gcf().set_size_inches(6, 3)
plt.show()
#Create two classifiers to see which learns in fewer steps.
from sklearn.ensemble import RandomForestClassifier
rf0 = RandomForestClassifier(n_estimators=10, random_state=np.random.RandomState(0))
rf1 = RandomForestClassifier(n_estimators=100, random_state=np.random.RandomState(1))
pi_rf0 = PseudoIncrementalBatchLearner(rf0, random_state=np.random.RandomState(0))
pi_rf1 = PseudoIncrementalBatchLearner(rf1, random_state=np.random.RandomState(1))
#Run pseudo-incremental training (training on larger portions of same data, each step)
val_scores0, val_scores1 = [], []
for i in range(pi_rf0.max_steps):
pi_rf0.partial_fit(X, y)
pi_rf1.partial_fit(X, y)
val_scores0.append(pi_rf0.score(X_val, y_val))
val_scores1.append(pi_rf1.score(X_val, y_val))
#Plot results
plt.plot(val_scores0, lw=2, label='rf0 validation')
plt.plot(val_scores1, lw=2, label='rf1 validation')
plt.xlabel('training "step" (i.e. proportion of the training data)')
plt.gca().set_xticks(range(pi_rf0.max_steps))
plt.ylabel('accuracy')
plt.gcf().set_size_inches(8, 2.5)
plt.gcf().legend()
Upvotes: 0