How number of thresholds is defined in sklearn roc_curve function?

Question

When I used sklearn function roc_curve on my data with logistic regression model:

roc_curve(y_test, predictions_test)

I got this result:

(array([0. , 0.1, 1. ]), array([0.   , 0.865, 1.   ]), array([2, 1, 0]))
In [137]:

I know that in third array there are thresholds and in first and second there are corresponding TPR and FPR. But I dint understand why there are three thresholds. How number of thresholds is defined in this function? For example when I use logistic regression, thresholds must be probabilities from sigmoid function, but here they are 2,1,0. Why so?

amiola · Accepted Answer

As you might see from the source code (within the call to _binary_clf_curve(), in turn called by roc_curve() here) the number of thresholds is actually defined by the number of distinct predictions_test (scores, in principle). From your output, however, I would suppose predictions_test might be the output of .predict() (perhaps of a multiclass classification problem? - in which case by the way you'll need to extend the ROC curve definition to deal with multiclass setting) rather than of .predict_proba() or .decision_function() as roc_curve requires.

Moreover, be aware that roc_curve also has a parameter drop_intermediate (default to True) which, in some cases, might drop suboptimal thresholds.

Eventually, I'd suggest the following posts:

Plotting the ROC curve for a multiclass problem for the ROC curve extension to a multiclass setting;
sklearn.metrics.roc_curve only shows 5 fprs, tprs, thresholds or sklearn's roc_curve() function returns thresholds and fpr of different dimensions for a better understanding of the implications of the parameter drop_intermediate=True.

How number of thresholds is defined in sklearn roc_curve function?

Answers (1)

Related Questions