Reputation: 814
I have the following code that calculates precision-recall curve for object detection task, where detections are matched to ground-truth first by creating 1-to-1 pairs starting from detection with the highest confidence score and matching it to ground-truth object for which the overlap is the highest. The results are stored in detections_matches
vector, in which the value is True
if detection was matched against some ground-truth object and False
otherwise. Then, this PR-curve is used to calculate Average Precision Score.
def precision_recall_curve(
detection_matches: np.ndarray, detection_scores: np.ndarray, total_ground_truths: int
):
sorted_detection_indices = np.argsort(detection_scores, kind="stable")[::-1]
detection_scores = detection_scores[sorted_detection_indices]
detection_matches = detection_matches[sorted_detection_indices]
threshold_indices = np.r_[np.where(np.diff(detection_scores))[0], detection_matches.size - 1]
confidence_thresholds = detection_scores[threshold_indices]
true_positives = np.cumsum(detection_matches)[threshold_indices]
false_positives = np.cumsum(~detection_matches)[threshold_indices]
precision = true_positives / (true_positives + false_positives)
precision[np.isnan(precision)] = 0
recall = true_positives / total_ground_truths
full_recall_idx = true_positives.searchsorted(true_positives[-1])
reversed_slice = slice(full_recall_idx, None, -1)
return np.r_[precision[reversed_slice], 1], np.r_[recall[reversed_slice], 0]
def ap_score(precision, recall):
return -np.sum(np.diff(recall) * np.array(precision)[:-1])
This can be used to calculate AP-score for the example vectors:
detection_matches = np.array([True, True, True, True, True, True, False, True])
detection_scores = np.array([0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55])
total_ground_truths = 10
precision, recall = precision_recall_curve(detection_matches, detection_scores, total_ground_truths)
# (array([0.875 , 0.85714286, 1. , 1. , 1. ,
# 1. , 1. , 1. , 1. ]),
# array([0.7, 0.6, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0. ]))
ap_score(precision, recall)
# 0.6875
However, adding more detections, even with super-low confidence increases the AP-score, which doesn't seem correct.
detection_matches = np.array([True, True, True, True, True, True, False, True, True, False, False, False, False, False, False])
detection_scores = np.array([0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.04, 0.03, 0.02, 0.015, 0.012, 0.011, 0.01])
total_ground_truths = 10
precision, recall = precision_recall_curve(detection_matches, detection_scores, total_ground_truths)
# (array([0.88888889, 0.875 , 0.85714286, 1. , 1. ,
# 1. , 1. , 1. , 1. , 1. ]),
# array([0.8, 0.7, 0.6, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0. ]))
ap_score(precision, recall)
# 0.7763888888888889
I can see this is because low precision scores from precision vector (array([1., 1., 1., 1., 1., 1., 0.85714286, 0.875, 0.88888889, 0.8, 0.72727273, 0.66666667, 0.61538462, 0.57142857, 0.53333333])
) are effectively ignored by the fact that both precision and recall are trimmed at the index where recall reaches full value. However, even when we don't trim, recall is constant and therefore the difference of recall is 0, so low precision scores are not taken into account anyway.
Is there a bug in this implementation? If so, what should be adjusted to make low precision score impact (negatively) the AP score? Or is it a case where AP score just doesn't work intuitively?
Upvotes: 3
Views: 753
Reputation: 2701
I think something wired is happening in the function you've write for the precision-recall curve. I've compared your curve with the sklearn.metrics.precision_recall_curve
and this is the result:
detection_matches = np.array([True, True, True, True, True, True, False, True])
detection_scores = np.array([0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55])
total_ground_truths = 10
detection_matches = np.array([True, True, True, True, True, True, False, True, True, False, False, False, False, False, False])
detection_scores = np.array([0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.04, 0.03, 0.02, 0.015, 0.012, 0.011, 0.01])
total_ground_truths = 10
Is it a desired behavior? Do you expect that the curve differs from the original? Does total_ground_truths
play a crucial role I cannot see?
In any case I agree with @mrk about the oversimplification and I've rewrite the functions to simplify them:
def my_precision_recall_curve(
detection_matches: np.ndarray, detection_scores: np.ndarray, total_ground_truths: int
):
sorted_detection_indices = np.argsort(detection_scores, kind="stable")[::-1]
detection_scores = detection_scores[sorted_detection_indices]
detection_matches = detection_matches[sorted_detection_indices]
positives = detection_scores>=detection_scores[:,None]
negatives = ~positives
true_positives = positives[:, detection_matches].sum(axis=1)
false_positives = positives[:, ~detection_matches].sum(axis=1)
false_negatives = (negatives[:, detection_matches]).sum(axis=1)
precision = true_positives/(true_positives+false_positives)
recall = true_positives/(true_positives+false_negatives)
return np.concatenate([[1], precision]), np.concatenate([[0], recall])
def my_ap_score(precision, recall):
return (np.diff(recall) * np.array(precision)[:-1]).sum()
My curves and scores overlap with the sklearn
ones for both examples you've made:
BTW I think that setting total_ground_truths==detection_matches.sum()
in your function solves the discrepancy
Upvotes: -2
Reputation: 10396
- Is there a bug in the implementation?
No, but maybe some oversimplifications
I think the first thing to understand is that The Average Precision (AP) score is influenced by both precision and recall. Yep the name is a little misleading.
But what you actually calculating here says it all:
def ap_score(precision, recall):
return -np.sum(np.diff(recall) * np.array(precision)[:-1])
You are basically calculating the area under the curve of your precision over recall curve, while the diff in recall values only approximates the width (0.1) of the "rectangle" that forms when you multiply your precision value with that width.
However where your intuition is compromised can be seen when you check the number of entries, this is where the magic happens.
As you are averaging by simply summing over the "rectangles" (relying on the width being 0.1 between recall values), every rectangle counts. And as the second object detector configuration has an additional rectangle due to your confidence cut-offs this pays big time.
- What should be adjusted to make low precision score impact (negatively) the AP score?
The hickups you are observing might as well vanish with more detections.
You are sanity checking this code with very little data. Usually this curve relies on at least hundreds of detections to get a proper idea of the performance of your object detector. Also the issue with having multiple or no entries for specific recall values might resolve themselves.
In case you want to have a metric that penalizes precision more, just write a metric for that. For example you could simply start by looking at the average precision:
avg_precision = np.mean(precision)
# First example: 0.97
# Second example: 0.95
Looks more like what you expected to see I guess. When analyzing and evaluating an object detector (or any system for that matter), there won't be that one metric you can go to, which will inform you about every characteristic of the system.
I hope I ould help. Cheers.
Note: Observe that your code does not deal with duplicate recall values (e.g. 0.6) which are just computed as separate "rectangles". This is done often in these kinds of approximations, how you want to deal with that is up to you to decide. An easy solutions would be to average the precision value for that "rectangle".
Upvotes: 1