Reputation: 724

sklearn precision_recall_curve and threshold

I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example

import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

which then gives

>>>precision  
    array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
    array([1. , 0.5, 0.5, 0. ])
>>> thresholds
    array([0.35, 0.4 , 0.8 ])

Could someone explain to me how to get those recalls and precisions by showing me what is computed?

Upvotes: 2

Answers (1)

amiola

Reputation: 3036

I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve() following sklearn implementation.

Decision scores are ordered in descending order and labels according to the just obtained order:

desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]

You'll get:

y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))

sklearn implementation then foresees to exclude the duplicated values of y_scores (no duplicates in this example).

distinct_value_indices = np.where(np.diff(y_scores))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]

Due to the absence of duplicates you'll get:

distinct_value_indices, threshold_idxs 
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))

Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.

# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]

After this steps you'll have two arrays with the number of true positives and false positives per considered score.

tps, fps
(array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))

Eventually, you can compute precision and recall.
```
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]

precision, recall
(array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))
```
An important point that causes the thresholds array to be shorter than the y_score one (even though there are no duplicates in y_score) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall equal to 1 defines the length of the thresholds array (index 2 here, corresponding to length=3 and reason why the length of thresholds is 3). The reason behind this is that once you get full recall, further lowering the threshold will only introduce unnecessary fp (or, in other terms, you won't have any further tp) and will not affect recall which will remain equal to 1, by definition.
```
last_ind = tps.searchsorted(tps[-1])   # 2
sl = slice(last_ind, None, -1)         # from index 2 to 0

precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]

(array([0.66666667, 0.5       , 1.        , 1.        ]),
array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
```
Last point, the length of precision and recall is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.

Upvotes: 6

sklearn precision_recall_curve and threshold

Answers (1)

Related Questions