Benoît Boidin
Benoît Boidin

Reputation: 87

What do the FiftyOne evaluation metrics mean?

I have a dataset I use to test my object detection model, let's say test_dataset.

When evaluating with COCO eval (through YOLOX eval.py script) for a given model, I get this result:

Average forward time: 23.05 ms, Average NMS time: 2.60 ms, Average inference time: 25.65 ms
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.724
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.957
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.831
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.591
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.810
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.755
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.759
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.349
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.649
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.839
per class AP:
| class        | AP     | class   | AP     | class        | AP     |
|:-------------|:-------|:--------|:-------|:-------------|:-------|
| cargo        | 59.491 | ferry   | 87.701 | fishing boat | 67.328 |
| sailing boat | 75.134 |         |        |              |        |
per class AR:
| class        | AR     | class   | AR     | class        | AR     |
|:-------------|:-------|:--------|:-------|:-------------|:-------|
| cargo        | 64.802 | ferry   | 89.717 | fishing boat | 71.506 |
| sailing boat | 77.748 | 

However, when evaluating with FiftyOne, I get the following:

              precision    recall  f1-score   support

       cargo       0.76      0.91      0.83       606
       ferry       0.97      1.00      0.99       990
fishing boat       0.85      0.96      0.91       332
sailing boat       0.87      0.97      0.92       706

   micro avg       0.88      0.97      0.92      2634
   macro avg       0.87      0.96      0.91      2634
weighted avg       0.88      0.97      0.92      2634

I was using this script:

results = dataset.evaluate_detections(
    "predictions",
    gt_field="detections",
    compute_mAP=True,
    method="coco"
)

results.print_report()

I was expecting the same precision and recall metrics, since they both use COCO style evaluation. Setting the IoU parameter doesn't help getting coherent values.

I can't understand how to make theses metrics match.

Upvotes: 0

Views: 172

Answers (1)

Jacob
Jacob

Reputation: 111

FiftyOne's evaluate_detections() only performs the evaluation at one specified IoU threshold (default is 0.5). Precision and recall are shown per-class at the top, and averaged at the bottom. Micro-averaging means assigning equal weight to every example across all classes. Macro-averaging means assigning equal weight to each class, regardless of the number of examples. Support refers to the number of instances used in the evaluation — a 70% recall when there are only 10 examples means something very different than a 70% recall with 1,000 examples.

To get alignment with the COCO evaluation script, you would want to run FiftyOne's evaluation routine at each IoU threshold from 0.5 to 0.95 (spacing of 0.05) and aggregate the results.

Upvotes: 2

Related Questions