Reputation: 693
From the business perspective, false negatives lead to about tenfold higher costs (real money) than false positives. Given my standard binary classification models (logit, random forest, etc.), how can I incorporate this into my model?
Do I have to change (weight) the loss function in favor of the 'preferred' error (FP) ? If so, how to do that?
Upvotes: 21
Views: 22941
Reputation: 1863
As @Maxim mentioned, there are 2 stages to make this kind of tuning: in the model training stage (like custom weights) and the prediction stage (like lowering the decision threshold).
Another tuning for the model-training stage is using a recall scorer. you can use it in your grid-search cross-validation (GridSearchCV) for tuning your classifier with the best hyper-param towards high recall.
GridSearchCV scoring parameter can either accepts the 'recall' string or the function recall_score.
Since you're using a binary classification, both options should work out of the box, and call recall_score with its default values that suits a binary classification:
Should you need to custom it, you can wrap an existing scorer, or a custom one, with make_scorer, and pass it to the scoring parameter.
For example:
from sklearn.metrics import recall_score, make_scorer
recall_custom_scorer = make_scorer(
lambda y, y_pred, **kwargs: recall_score(y, y_pred, pos_label='yes')[1]
)
GridSearchCV(estimator=est, param_grid=param_grid, scoring=recall_custom_scorer, ...)
Upvotes: 7
Reputation: 53766
There are several options for you:
As suggested in the comments, class_weight
should boost the loss function towards the preferred class. This option is supported by various estimators, including sklearn.linear_model.LogisticRegression
,
sklearn.svm.SVC
, sklearn.ensemble.RandomForestClassifier
, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.
You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to 1.0
effectively means false_negatives
close to 0.0
, which is what to want. For that, use sklearn.model_selection.cross_val_predict
and sklearn.metrics.precision_recall_curve
functions:
y_scores = cross_val_predict(classifier, x_train, y_train, cv=3,
method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
If you plot the precisions
and recalls
against the thresholds
, you should see the picture like this:
After picking the best threshold, you can use the raw scores from classifier.decision_function()
method for your final classification.
Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).
Upvotes: 23