aurfa
aurfa

Reputation: 21

Optuna pruned trial for random forest classifier

I am currently working on Optuna library and I have seen that there is a parameter which allows to prune unpromissing trials. It seems that this parameter can only be used with incremental learning method such as SGD classifier or with neural network. Hence, I was wondering, is it possible to use prune trials when using a random forest, a decision dree CART or even a logistic regression ?

Thanks a lot ! :)

PS : I did not find any example on the internet which used random forest with pruned trials using optuna ...

Upvotes: 1

Views: 305

Answers (1)

MuhammedYunus
MuhammedYunus

Reputation: 5010

SGDClassifier with loss='cross_entropy' performs logistic regression, enabling you to use incremental learning for logistic regression.

As for random forests and decsion trees; they are batch learners, so trial pruning doesn't apply. However, you can wrap batch learners in a class (PseudoIncrementalBatchLearner below) that refits the learner on more and more data each time you call partial_fit(). This is similar to how a learning curve is generated, where the estimator is refitted on increasing portions of the dataset.

In general, as a learner is fit on larger portions of the data, its generalisation error will go down - the trend is somewhat predictable for an individual estimator. However, when comparing learners, you might like to prune ones that are relatively slow to improve, and which are too expensive to train on the entire dataset...this is where PsueoIncrementalBatchLearner below could be useful.

The data + example below shows how the blue random forest is slow to improve compared to the orange one, and therefore the blue one is a candidate for early pruning. This avoids you having to train the learner on the full dataset (although at the end they're comparable).

enter image description here

from sklearn import model_selection
from sklearn import base
import numpy as np

#
#Wraps a batch learner, training it on larger portions of the data
# each time partial_fit() is called
#
class PseudoIncrementalBatchLearner(
    base.BaseEstimator,
    base.MetaEstimatorMixin,
    base.ClassifierMixin,
    base.RegressorMixin
):
    def __init__(self, estimator, max_steps=20, random_state=None):
        self.estimator = estimator
        self.max_steps = max_steps
        self.random_state = random_state
    
    def partial_fit(self, X, y):
        X, y = base.check_X_y(X, y)
            
        if hasattr(X, 'columns'):
            self.feature_names_in_ = np.array(
                X.columns, dtype='object'
            )
        self.n_features_in_ = X.shape[1]
        
        if not hasattr(self, 'current_step_'):
            self.current_step_ = 0
        
        #Get ShuffleSplit/StratifiedShuffleSplit for regressor/classifier
        cv = getattr(
            model_selection,
            ('Stratified' if base.is_classifier(self.estimator) else '') + 'ShuffleSplit'
        )
        
        #Shuffle and split off the required size for this step
        if self.current_step_ + 1 < self.max_steps:
            train_ix, _ = next(cv(
                n_splits=1, train_size=(self.current_step_ + 1) / self.max_steps
            ).split(X, y))
        else:
            train_ix = np.arange(len(X))

        #Beyond max_steps, no more refitting, as already fit on all data.
        # Could optionally comment this part out.
        if self.current_step_ + 1 > self.max_steps:
            return self
        
        #Refit estimator on the current portion of the dataset
        self.estimator_ = base.clone(self.estimator).fit(X[train_ix], y[train_ix])
        self.current_step_ += 1
        return self
    
    def predict(self, X):
        return self.estimator_.predict(X)


#
#Make test dataset
#
from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.2, random_state=0)
X_val, y_val = make_moons(n_samples=200, noise=0.2, random_state=1)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.gcf().set_size_inches(6, 3)
plt.show()

#Create two classifiers to see which learns in fewer steps.
from sklearn.ensemble import RandomForestClassifier
rf0 = RandomForestClassifier(n_estimators=10, random_state=np.random.RandomState(0))
rf1 = RandomForestClassifier(n_estimators=100, random_state=np.random.RandomState(1))

pi_rf0 = PseudoIncrementalBatchLearner(rf0, random_state=np.random.RandomState(0))
pi_rf1 = PseudoIncrementalBatchLearner(rf1, random_state=np.random.RandomState(1))

#Run pseudo-incremental training (training on larger portions of same data, each step)
val_scores0, val_scores1 = [], []
for i in range(pi_rf0.max_steps):
    pi_rf0.partial_fit(X, y)
    pi_rf1.partial_fit(X, y)
    
    val_scores0.append(pi_rf0.score(X_val, y_val))
    val_scores1.append(pi_rf1.score(X_val, y_val))

#Plot results
plt.plot(val_scores0, lw=2, label='rf0 validation')
plt.plot(val_scores1, lw=2, label='rf1 validation')
plt.xlabel('training "step" (i.e. proportion of the training data)')
plt.gca().set_xticks(range(pi_rf0.max_steps))
plt.ylabel('accuracy')
plt.gcf().set_size_inches(8, 2.5)
plt.gcf().legend()

Upvotes: 0

Related Questions