Reputation: 21
The question is similar to the one mentioned in the link below, please read it for reference.
How does sklearn calculate the area under the roc curve for two binary inputs?
I understand that everything is happening in sklearn.metrics._binary_clf_curve
.
But for binary classification, how are multiple thresholds being calculated/decided in the said function. The function returns y_score[threshold_idxs]
as thresholds to plot roc_curve, I am unable to understand the calculation of y_score[threshold_idxs]
and why will this be threshold.
Upvotes: 0
Views: 962
Reputation: 8903
Let's use the scikit-learn 0.22.2 documentation as a compass to understand each component of the function and the final result.
sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
"active" parameters if using the default call:
y_true
: array, shape = [n_samples], True binary labels.y_score
: array, shape = [n_samples]. Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisionsdrop_intermediate
: boolean, optional (default=True), Whether to drop some suboptimal thresholds which would not appear on a plotted ROC curve.outputs:
fpr
: array, shape = [>2], Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].tpr
: array, shape = [>2], Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].thresholds
: array, shape = [n_thresholds], Decreasing thresholds on the decision function used to compute fpr and tprNow, considering the code for roc_curve()
, it calls the function _binary_clf_curve()
, where after proper manipulations and sorting, it computes:
distinct_value_indices = np.where(np.diff(y_score))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
The explanation of these lines is in the comment:
y_score
typically has many tied values. Here we extract the indices associated with the distinct values. We also concatenate a value for the end of the curve.
The two lines above roughly answer to your question how are multiple thresholds being calculated/decided.
Then, it computes:
tps = stable_cumsum(y_true * weight)[threshold_idxs]
fps = 1 + threshold_idxs - tps
and returns:
return fps, tps, y_score[threshold_idxs]
After that, returning on the main function roc_curve()
, if if drop_intermediate and len(fps) > 2:
, it
attemps to drop thresholds corresponding to points in between and collinear with other points.
optimal_idxs = np.where(np.r_[True,
np.logical_or(np.diff(fps, 2),
np.diff(tps, 2)),
True])[0]
and the "new" values are:
fps = fps[optimal_idxs]
tps = tps[optimal_idxs]
thresholds = thresholds[optimal_idxs]
after that you can see other manipulations, but the core is what I have highlighted above.
Upvotes: 3