Reputation: 587
A lot of popular and state of the art object detection algorithms like YOLO and SSD use the concepts of anchor boxes. As far as I understand for networks like YOLO v3, each output grid cell has multiple anchor boxes with different aspect ratios. For detection the network predicts offset for the anchor box with the highest overlap a the given object. Why is this used instead of having multiple bounding box predictors ( each predicting x, y, w, h and c ).
Upvotes: 1
Views: 1219
Reputation: 4051
No, anchor boxes cannot be simply replaced by multiple bounding box predictors.
In your description, there was a minor misunderstanding.
For detection the network predicts offset for the anchor box with the highest overlap a the given object
Selecting the anchor box with the highest overlap to a groundtruth only happens during training phase. As explained in the SSD paper section 2.2 Matching Strategy. Not only the highest overlap anchor boxes are selected but also the ones that has IoU bigger than 0.5.
During prediction time, the box predictor will predict the four offsets of each anchor box together with confidences for all categories.
Now it comes to the question of why predicting the offsets instead of box attributes (x,y, c,h).
In short, this is related to scales. For this I agree with @viceriel's answer but here is an vivid example.
Suppose the following two images of the same size (the left one has blue background) are fed to the predictor and we want to get the bbox for the dog. Now the red bbox in each image represent the anchor boxes, both are about perfect bbox for the dog. If we predict the offset, the box predictor only needs to predict 0 for the four offsets in both cases. While if you use multiple predictor, the model has to give two different sets of values for c
and h
while x
and y
are the same. This essentially is what @vicerial explains as predicting offsets will present a less difficult mapping for the predictor to learn.
This example also explains why anchor boxes can help improve detector's performance.
Upvotes: 2
Reputation: 873
Key is the understand how anchor boxes are created. For example YOLOv3 take sizes of bounding boxes from training set apply K-means to them and find box sizes which describes well all boxes present at training set.
If you predict w, h instead of offset of anchor box your possible outputs will be more variable, in sense there will be many, many possible heights and widths for bounding box. But if you instead predict offset for box which somehow have appropriate size for your object detection task, there will be less variability because anchor boxes describe wanted bounding boxes. Which leads to better performance for the network, because you reframe the task and network now learns less difficult input-output mapping.
Upvotes: 1