Reputation: 724
I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example
import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
which then gives
>>>precision
array([0.66666667, 0.5 , 1. , 1. ])
>>> recall
array([1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.35, 0.4 , 0.8 ])
Could someone explain to me how to get those recalls and precisions by showing me what is computed?
Upvotes: 2
Views: 6424
Reputation: 3026
I know I am a bit late here, but I had a similar doubt that the link you provided has cleared up. Roughly speaking, here is what happens inside precision_recall_curve()
following sklearn
implementation.
Decision scores are ordered in descending order and labels according to the just obtained order:
desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]
You'll get:
y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
sklearn
implementation then foresees to exclude the duplicated values of y_scores
(no duplicates in this example).
distinct_value_indices = np.where(np.diff(y_scores))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
Due to the absence of duplicates you'll get:
distinct_value_indices, threshold_idxs
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
Finally you can compute the number of true positives and false positives through which you can in turn compute precision and recall.
# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]
After this steps you'll have two arrays with the number of true positives and false positives per considered score.
tps, fps
(array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
Eventually, you can compute precision and recall.
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]
precision, recall
(array([1. , 0.5 , 0.66666667, 0.5 ]), array([0.5, 0.5, 1. , 1. ]))
An important point that causes the thresholds
array to be shorter than the y_score
one (even though there are no duplicates in y_score
) is the one that was pointed out within the link you referenced. Basically, the index of the first occurrence of recall
equal to 1 defines the length of the thresholds
array (index 2 here, corresponding to length=3 and reason why the length of thresholds
is 3). The reason behind this is that once you get full recall, further lowering the threshold will only introduce unnecessary fp (or, in other terms, you won't have any further tp) and will not affect recall which will remain equal to 1, by definition.
last_ind = tps.searchsorted(tps[-1]) # 2
sl = slice(last_ind, None, -1) # from index 2 to 0
precision, recall, thresholds = np.r_[precision[sl], 1], np.r_[recall[sl], 0], y_scores[sl]
(array([0.66666667, 0.5 , 1. , 1. ]),
array([1. , 0.5, 0.5, 0. ]), array([0.35, 0.4 , 0.8 ]))
Last point, the length of precision
and recall
is 4 because values of precision equal to 1 and recall equal to 0 are concatenated to the obtained arrays in order to let the precision-recall curve start in correspondence of the y-axis.
Upvotes: 6