Reputation: 424
Questions:
Thank you very much for helping me out with this I am quite frustrated.
def objective(trial, X, y):
param_grid = {
# "device_type": trial.suggest_categorical("device_type", ['gpu']),
"n_estimators": trial.suggest_categorical("n_estimators", [999999]),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
"num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
"lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
"lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
"min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
"bagging_fraction": trial.suggest_float(
"bagging_fraction", 0.2, 0.95, step=0.1
),
"bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
"feature_fraction": trial.suggest_float(
"feature_fraction", 0.2, 0.95, step=0.1
),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1121218)
cv_scores = np.empty(5)
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = LGBMClassifier(
objective="binary",
**param_grid,
n_jobs=-1,
scale_pos_weight=len(y_train) / y_train.sum()
)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric="binary_logloss", # replace this with e.g. balanced accuracy or f1
callbacks=[
LightGBMPruningCallback(trial, "binary_logloss"), # replace this with e.g. balanced accuracy or f1
early_stopping(100, verbose=False)
],
)
preds = model.predict(X_test)#.argmax(axis=1)
cv_scores[idx] = balanced_accuracy_score(y_test, preds)
loss = 1 - np.nanmedian(cv_scores)
return loss
Run:
study = optuna.create_study(direction="minimize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X_train, y_train)
study.optimize(func, n_trials=1)
Fit the final problem. But here I don't want to fit with n_estimators=999999, but with the optimal number of n_estimators:
model = LGBMClassifier(
objective="binary",
**study.best_params,
n_jobs=-1,
scale_pos_weight=len(y) / y.sum()
)
Upvotes: 4
Views: 2416
Reputation: 424
So after a day of experimenting I can answer my own questions:
The LGBM pruning defined by LightGBMPruningCallback(trial, "your_metric") is NOT referring to the early stopping procedure. The LGBM pruning essentially skips evaluating all cv-folds within a given trial (i.e. for a given set of hyper parameters) if the metric is very unsatisfactory (e.g. low balanced accuracy).
This was very annoying, the solution is not well documented, but it is to set metric='custom' in LGBMClassifier then define the metric in a function and set eval_metric=your_function, see the code below.
It may be a way to retrieve n_estimators for the optimal trial (best params), however, I solved it by fitting the final model with early stopping, see the code below:
CODE
Define a custom metric
def custom_metric(y_true, y_hat):
higher_is_better = True
y_hat_label = np.round(y_hat)
balanced_accuracy = balanced_accuracy_score(y_true, y_hat_label)
return 'balanced_accuracy', balanced_accuracy, higher_is_better
Define the objective function (important changes wrt to my question above are commented):
def objective(trial, X, y):
param_grid = {
"n_estimators": trial.suggest_categorical("n_estimators", [999999]),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
"num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
"lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
"lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
"min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
"bagging_fraction": trial.suggest_float(
"bagging_fraction", 0.2, 0.95, step=0.1
),
"bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
"feature_fraction": trial.suggest_float(
"feature_fraction", 0.2, 0.95, step=0.1
),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1121218)
cv_scores = np.empty(5)
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model = LGBMClassifier(
metric='custom', #THIS HAS CHANGED (REF QUESTION 2)!
objective="binary",
**param_grid,
n_jobs=-1,
scale_pos_weight=len(y_train) / y_train.sum()
)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric=[custom_metric], # THIS HAS CHANGED (REF QUESTION 2)!
callbacks=[
LightGBMPruningCallback(trial, "balanced_accuracy"), # THIS HAS CHANGED (REF QUESTION 2)!
early_stopping(100, verbose=True),
], # Add a pruning callback
)
preds = model.predict(X_test)#.argmax(axis=1)
cv_scores[idx] = balanced_accuracy_score(y_test, preds)
score = np.nanmedian(cv_scores)
return score
The optimization:
study = optuna.create_study(direction="maximize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X_train, y_train)
study.optimize(func, n_trials=10)
And finally fitting the final model (i.e. answer to question 3). I solved this by using early stopping for the final model:
model = LGBMClassifier(
objective="binary",
metric='custom', # THIS HAS CHANGED (REF QUESTION 2)!
**study.best_params,
n_jobs=-1,
scale_pos_weight=len(y) / y.sum()
)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric=custom_metric,
early_stopping_rounds=100,
callbacks=[
early_stopping(100, verbose=True),
],
)
This algorithm will apply early stopping for each LGBM model applied to each fold within each trial (i.e. combination of hyper parameters).
It will in addition prune (i.e stop) certain trials that give unsatisfactory score metrics before it has applied the algorithm to all five folds. Some trials will be stopped very early.
It then continues to fit the final model - after the search is done. In the final fit the model use early stopping (note that I use a different evaluation set in the final fit).
And that's it, have a great day :)
Upvotes: 6