Custom eval metric using early stopping in LGBM (Sklearn API) and Optuna

Question

Questions:

The first question is probably extremely stupid but I will ask anyway: Is the pruning and the early stopping the same in this example below? Or are they two separate options controlling two separate processes?
I got an imbalanced target, so how can I use a custom evaluation metric here instead of 'binary_logloss' such as e.g. balanced accuracy?
When I get the optimal parameters, the 'n_estimators' will still equal 999999. Using an "infinite" number of estimators and pruning using early stopping is recommended for imbalanced targets so that's why it's so high. How do fit the final model with the optimal n_estimators post-pruning?

Thank you very much for helping me out with this I am quite frustrated.

def objective(trial, X, y):
    param_grid = {
        # "device_type": trial.suggest_categorical("device_type", ['gpu']),
        "n_estimators": trial.suggest_categorical("n_estimators", [999999]),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
        "bagging_fraction": trial.suggest_float(
            "bagging_fraction", 0.2, 0.95, step=0.1
        ),
        "bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
        "feature_fraction": trial.suggest_float(
            "feature_fraction", 0.2, 0.95, step=0.1
        ),
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1121218)

    cv_scores = np.empty(5)
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        model = LGBMClassifier(
            objective="binary",
            **param_grid,
            n_jobs=-1,
            scale_pos_weight=len(y_train) / y_train.sum()
        )
        
        model.fit( 
            X_train,
            y_train,
            eval_set=[(X_test, y_test)],
            eval_metric="binary_logloss", # replace this with e.g. balanced accuracy or f1
            callbacks=[
                LightGBMPruningCallback(trial, "binary_logloss"), # replace this with e.g. balanced accuracy or f1
                early_stopping(100, verbose=False)
            ], 
        )
        preds = model.predict(X_test)#.argmax(axis=1)
        cv_scores[idx] = balanced_accuracy_score(y_test, preds)
    
    loss = 1 - np.nanmedian(cv_scores)
    return loss

Run:

study = optuna.create_study(direction="minimize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X_train, y_train)
study.optimize(func, n_trials=1)

Fit the final problem. But here I don't want to fit with n_estimators=999999, but with the optimal number of n_estimators:

model = LGBMClassifier(
    objective="binary",
    **study.best_params,
    n_jobs=-1,
    scale_pos_weight=len(y) / y.sum()
)

Custom eval metric using early stopping in LGBM (Sklearn API) and Optuna

Answers (1)

Related Questions