Optimal threshold for imbalanced binar classification problem

Question

i have trouble optimizing threshold for binar classification. I am using 3 models: Logistic Regression, Catboost and Sklearn RandomForestClassifier.

For each model I am doing the following steps:

1) fit model

2) get 0.0 recall for first class (which belongs to 5% of dataset) and 1.0 recall for zero class. (this can't be fixed with gridsearch and class_weight='balanced' parameter.) >:(

3) Find optimal treshold

fpr, tpr, thresholds = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
optimal_threshold = thresholds[np.argmax(tpr - fpr)]

4) Enjoy ~70 recall ratio for both classes.

5) Predict probabilities for test dataset and use optimal_threshold, i calculated above, to get classes.

Here comes the question: when I am starting code again and again, if i don't fix random_state, optimal treshold is variant and shifts quiet dramatically. This leads to dramatic changes in accuracy metrics based on test sample.

Do i need to calculate some average threshold and use it as a constant hard value? Or maybe i have to fix random_state everywhere? Or maybe the method of finding optimal_threshold isnt correct?

Optimal threshold for imbalanced binar classification problem

Answers (1)

Related Questions