Reputation: 87
I have a dataset I use to test my object detection model, let's say test_dataset
.
When evaluating with COCO eval (through YOLOX eval.py script) for a given model, I get this result:
Average forward time: 23.05 ms, Average NMS time: 2.60 ms, Average inference time: 25.65 ms
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.724
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.957
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.831
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.591
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.810
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.535
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.755
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.759
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.349
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.649
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.839
per class AP:
| class | AP | class | AP | class | AP |
|:-------------|:-------|:--------|:-------|:-------------|:-------|
| cargo | 59.491 | ferry | 87.701 | fishing boat | 67.328 |
| sailing boat | 75.134 | | | | |
per class AR:
| class | AR | class | AR | class | AR |
|:-------------|:-------|:--------|:-------|:-------------|:-------|
| cargo | 64.802 | ferry | 89.717 | fishing boat | 71.506 |
| sailing boat | 77.748 |
However, when evaluating with FiftyOne, I get the following:
precision recall f1-score support
cargo 0.76 0.91 0.83 606
ferry 0.97 1.00 0.99 990
fishing boat 0.85 0.96 0.91 332
sailing boat 0.87 0.97 0.92 706
micro avg 0.88 0.97 0.92 2634
macro avg 0.87 0.96 0.91 2634
weighted avg 0.88 0.97 0.92 2634
I was using this script:
results = dataset.evaluate_detections(
"predictions",
gt_field="detections",
compute_mAP=True,
method="coco"
)
results.print_report()
I was expecting the same precision and recall metrics, since they both use COCO style evaluation. Setting the IoU parameter doesn't help getting coherent values.
I can't understand how to make theses metrics match.
Upvotes: 0
Views: 172
Reputation: 111
FiftyOne's evaluate_detections()
only performs the evaluation at one specified IoU threshold (default is 0.5
). Precision and recall are shown per-class at the top, and averaged at the bottom. Micro-averaging means assigning equal weight to every example across all classes. Macro-averaging means assigning equal weight to each class, regardless of the number of examples. Support refers to the number of instances used in the evaluation — a 70% recall when there are only 10 examples means something very different than a 70% recall with 1,000 examples.
To get alignment with the COCO evaluation script, you would want to run FiftyOne's evaluation routine at each IoU threshold from 0.5 to 0.95 (spacing of 0.05) and aggregate the results.
Upvotes: 2