DanielTheRocketMan
DanielTheRocketMan

Reputation: 3249

Early stoping in Sklearn GradientBoostingRegressor

I am using a monitor-class as implemented here

class Monitor():

    """Monitor for early stopping in Gradient Boosting for classification.

    The monitor checks the validation loss between each training stage. When
    too many successive stages have increased the loss, the monitor will return
    true, stopping the training early.

    Parameters
    ----------
    X_valid : array-like, shape = [n_samples, n_features]
      Training vectors, where n_samples is the number of samples
      and n_features is the number of features.
    y_valid : array-like, shape = [n_samples]
      Target values (integers in classification, real numbers in
      regression)
      For classification, labels must correspond to classes.
    max_consecutive_decreases : int, optional (default=5)
      Early stopping criteria: when the number of consecutive iterations that
      result in a worse performance on the validation set exceeds this value,
      the training stops.
    """

    def __init__(self, X_valid, y_valid, max_consecutive_decreases=5):
        self.X_valid = X_valid
        self.y_valid = y_valid
        self.max_consecutive_decreases = max_consecutive_decreases
        self.losses = []


    def __call__(self, i, clf, args):
        if i == 0:
            self.consecutive_decreases_ = 0
            self.predictions = clf._init_decision_function(self.X_valid)

        predict_stage(clf.estimators_, i, self.X_valid, clf.learning_rate,
                      self.predictions)
        self.losses.append(clf.loss_(self.y_valid, self.predictions))

        if len(self.losses) >= 2 and self.losses[-1] > self.losses[-2]:
            self.consecutive_decreases_ += 1
        else:
            self.consecutive_decreases_ = 0

        if self.consecutive_decreases_ >= self.max_consecutive_decreases:
            print("f"
                  "({}): s {}.".format(self.consecutive_decreases_, i)),
            return True
        else:
            return False

params = { 'n_estimators':             nEstimators,
           'max_depth':                maxDepth,
           'min_samples_split':        minSamplesSplit,
           'min_samples_leaf':         minSamplesLeaf,
           'min_weight_fraction_leaf': minWeightFractionLeaf,
           'min_impurity_decrease':    minImpurityDecrease,
           'learning_rate':            0.01,
           'loss':                    'quantile',
           'alpha':                    alpha,
           'verbose':                  0
           }
model = ensemble.GradientBoostingRegressor( **params )
model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )  

It works very well. However, it is not clear for me what model this line

model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )

returns:

1) No model

2) The model trained before stopping

3) The model 25 iterations before ( note the parameter of the monitor )

If it is not (3), is it possible to make the estimator returning 3?

How can I do that?

It is worth mentioning that xgboost library does that, however it does allow to use the loss function that I need.

Upvotes: 1

Views: 804

Answers (1)

Yaron
Yaron

Reputation: 1852

the model returns the fit before the "stopping rule" stops the model - means your answer No.2 is the right one.

the problem with this 'monitor code' is that the chosen model in the end will be the one that include the 25 extra iterations. the chosen model should be your NO.3 answer.

I think the easy (and stupid) way to do that is by running the same model (with seed - to have same results) but keep the model no of iterations equal to (i - max_consecutive_decreases)

Upvotes: 1

Related Questions