Arnold
Arnold

Reputation: 153

Thresholds, False Positive Rate, True Positive Rate

I am trying to get a clear understanding of what goes into the calculations of the terms in the title. The documentation at https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics says

"A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings."

Here is some simple code I created from some predictions I did using keras.

import numpy as np
from sklearn import metrics
test1 = '0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0'        
pred1 = '0.04172871 0.01611879 0.01073375 0.03344169 0.04172871 0.04172871\
 0.00430162 0.04172871 0.04172871 0.04172871 0.07977659 0.905772\
 0.9396076  0.03344169 0.04172871 0.09125287 0.02964183 0.0641269\
 0.04172871 0.04172871 0.04172871 0.0641269  0.04172871 0.04172871\
 0.9919831  0.04172871 0.01611879 0.04172871 0.37865442 0.00240888'

test = np.array([int(i) for i in test1.split()])
pred =np.array([float(i) for i in pred1.split()])

print(type(test))

print(type(pred))

fpr, tpr, thresholds = metrics.roc_curve(test, pred)
print('false pos rate')
print(fpr)
print('true pos rate')
print(tpr)
print('thresholds')
print(thresholds)

I can see how it has selected the thresholds (in this case 10 values - lowest pred, highest pred+1) but why in this case 10 thresholds values, - why not some other number? I'd also like to be able to follow the algebra in how it gets the fpr and tpr values using the thresholds. The answer is probably in the documentation sentence I gave above but I have not got my head around in how the rate calculation works.

Here, respectively are the thresholds, the fp rates, and the tp rates

[1.9919831  0.9919831  0.37865442 0.07977659 0.0641269  0.04172871
 0.03344169 0.02964183 0.01611879 0.00240888]
[0.         0.         0.         0.07692308 0.15384615 0.69230769
 0.76923077 0.80769231 0.88461538 1.        ]
[0.   0.25 1.   1.   1.   1.   1.   1.   1.   1.  ]

Upvotes: 1

Views: 2628

Answers (1)

amiola
amiola

Reputation: 3036

As you might see here, the threshold vector is obtained as the vector of distinct scores (in your case it is given by the distinct values in your pred array).

This said, you should then consider that roc_curve has a further parameter (drop_intermediate - default to True) which is meant for dropping suboptimal thresholds. In your case, by passing it to False and therefore avoiding to drop specific thresholds, fpr, tpr, thresholds = metrics.roc_curve(test, pred, drop_intermediate=False), you'll see that you'll get a threshold vector of length equal to the number of distinct values in your scores vector, plus one.

Indeed, as you might see here, the threshold vector is further expanded of one element thresholds[0] + 1 to ensure that the ROC curve starts at (0,0).

For what concerns the computation of tpr (tp / (tp + fn)), consider what's written in the documentation:

tpr: ndarray of shape (>2,) Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].

To decode this, it might be better to stick to an easier example:

import numpy as np
from sklearn import metrics
y = np.array([0, 0, 1, 1])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, drop_intermediate=False)   
# thresholds = array([1.8, 0.8, 0.4, 0.35, 0.1])

Sorting scores in ascending order and considering y and thresholds accordingly, you'd have:

y = np.array([0, 1, 0, 1])
scores = np.array([0.1, 0.35, 0.4, 0.8])
thresholds = array([0.1, 0.35, 0.4, 0.8, 1.8])

# score = threshold = 0.1, y = 0  --> tp = 0, tp+fn = total number positives = 2 --> tpr = 0
# score = threshold = 0.35, y = 1 --> tp = 1, tp+fn = 2 --> tpr = 0.5
# score = threshold = 0.4, y = 0  --> tp = 1, tp+fn = 2 --> tpr = 0.5
# score = threshold = 0.8, y = 1  --> tp = 2, tp+fn = 2 --> tpr = 1
# score = threshold = 1.8         --> tp = 2, tp+fn = 2 --> tpr = 1

tp being the cumulative number of true positives.

Therefore, you'd have tpr = np.array([0, 0.5, 0.5, 1, 1]). The computation of fpr should be straightforward following its definition.

Upvotes: 3

Related Questions