Difference between box coordinate and anchor boxes in Keras

Question

I am trying to understand why do we need anchor boxes and box coordinate?

What I understand so far in SSD is that it will gives you an output of two thing. One is the class score and another is the Bounding box coordinate. What my understanding on anchor boxes so far is that it will be generating bounding boxes of different aspect ratio and do some NMS suppression to get a good bounding boxes. I thought that anchor boxes and box coordinate are the same. But why in this code we have three output mainly class score , box coordinates and anchor boxes. More specifically, what is anchor boxes returning? Is anchor boxes returning the set of all bounding boxes of different aspect ratio? Then, how is it different from the box coordinate. Maybe I am misunderstanding anchorboxes. Is anchorboxes acting like a Region proposal network and that boxes coordinates is returning the best Boxes from those anchorboxes list?

My main confusion here is the difference between anchor_concat and boxes_concat.

I am trying to understand the code from:

https://github.com/lvaleriu/ssd_keras-1/blob/master/keras_ssd7.py

# Build the convolutional predictor layers on top of conv layers 4, 5, 6, and 7.
# We build two predictor layers on top of each of these layers: One for class prediction (classification), one for box coordinate prediction (localization)
# We precidt `n_classes` confidence values for each box, hence the `classes` predictors have depth `n_boxes * n_classes`
# We predict 4 box coordinates for each box, hence the `boxes` predictors have depth `n_boxes * 4`
# Output shape of `classes`: `(batch, height, width, n_boxes * n_classes)`
classes4 = Conv2D(n_boxes[0] * n_classes, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='classes4')(conv4)
classes5 = Conv2D(n_boxes[1] * n_classes, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='classes5')(conv5)
classes6 = Conv2D(n_boxes[2] * n_classes, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='classes6')(conv6)
classes7 = Conv2D(n_boxes[3] * n_classes, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='classes7')(conv7)
# Output shape of `boxes`: `(batch, height, width, n_boxes * 4)`
boxes4 = Conv2D(n_boxes[0] * 4, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='boxes4')(conv4)
boxes5 = Conv2D(n_boxes[1] * 4, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='boxes5')(conv5)
boxes6 = Conv2D(n_boxes[2] * 4, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='boxes6')(conv6)
boxes7 = Conv2D(n_boxes[3] * 4, (3, 3), strides=(1, 1), padding="valid", kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='boxes7')(conv7)

# Generate the anchor boxes
# Output shape of `anchors`: `(batch, height, width, n_boxes, 8)`
anchors4 = AnchorBoxes(img_height, img_width, this_scale=scales[0], next_scale=scales[1], aspect_ratios=aspect_ratios[0], two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[0], this_offsets=offsets[0], limit_boxes=limit_boxes, variances=variances, coords=coords, normalize_coords=normalize_coords, name='anchors4')(boxes4)
anchors5 = AnchorBoxes(img_height, img_width, this_scale=scales[1], next_scale=scales[2], aspect_ratios=aspect_ratios[1], two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[1], this_offsets=offsets[1], limit_boxes=limit_boxes, variances=variances, coords=coords, normalize_coords=normalize_coords, name='anchors5')(boxes5)
anchors6 = AnchorBoxes(img_height, img_width, this_scale=scales[2], next_scale=scales[3], aspect_ratios=aspect_ratios[2], two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[2], this_offsets=offsets[2], limit_boxes=limit_boxes, variances=variances, coords=coords, normalize_coords=normalize_coords, name='anchors6')(boxes6)
anchors7 = AnchorBoxes(img_height, img_width, this_scale=scales[3], next_scale=scales[4], aspect_ratios=aspect_ratios[3], two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[3], this_offsets=offsets[3], limit_boxes=limit_boxes, variances=variances, coords=coords, normalize_coords=normalize_coords, name='anchors7')(boxes7)

# Reshape the class predictions, yielding 3D tensors of shape `(batch, height * width * n_boxes, n_classes)`
# We want the classes isolated in the last axis to perform softmax on them
classes4_reshaped = Reshape((-1, n_classes), name='classes4_reshape')(classes4)
classes5_reshaped = Reshape((-1, n_classes), name='classes5_reshape')(classes5)
classes6_reshaped = Reshape((-1, n_classes), name='classes6_reshape')(classes6)
classes7_reshaped = Reshape((-1, n_classes), name='classes7_reshape')(classes7)
# Reshape the box coordinate predictions, yielding 3D tensors of shape `(batch, height * width * n_boxes, 4)`
# We want the four box coordinates isolated in the last axis to compute the smooth L1 loss
boxes4_reshaped = Reshape((-1, 4), name='boxes4_reshape')(boxes4)
boxes5_reshaped = Reshape((-1, 4), name='boxes5_reshape')(boxes5)
boxes6_reshaped = Reshape((-1, 4), name='boxes6_reshape')(boxes6)
boxes7_reshaped = Reshape((-1, 4), name='boxes7_reshape')(boxes7)
# Reshape the anchor box tensors, yielding 3D tensors of shape `(batch, height * width * n_boxes, 8)`
anchors4_reshaped = Reshape((-1, 8), name='anchors4_reshape')(anchors4)
anchors5_reshaped = Reshape((-1, 8), name='anchors5_reshape')(anchors5)
anchors6_reshaped = Reshape((-1, 8), name='anchors6_reshape')(anchors6)
anchors7_reshaped = Reshape((-1, 8), name='anchors7_reshape')(anchors7)

# Concatenate the predictions from the different layers and the assosciated anchor box tensors
# Axis 0 (batch) and axis 2 (n_classes or 4, respectively) are identical for all layer predictions,
# so we want to concatenate along axis 1
# Output shape of `classes_merged`: (batch, n_boxes_total, n_classes)
classes_concat = Concatenate(axis=1, name='classes_concat'([classes4_reshaped, classes5_reshaped, classes6_reshaped, classes7_reshaped])

# Output shape of `boxes_final`: (batch, n_boxes_total, 4)
boxes_concat = Concatenate(axis=1, name='boxes_concat')([boxes4_reshaped, boxes5_reshaped, boxes6_reshaped, boxes7_reshaped])

# Output shape of `anchors_final`: (batch, n_boxes_total, 8)
anchors_concat = Concatenate(axis=1, name='anchors_concat')([anchors4_reshaped,anchors5_reshaped, anchors6_reshaped, anchors7_reshaped])

# The box coordinate predictions will go into the loss function just the way they are,
# but for the class predictions, we'll apply a softmax activation layer first
classes_softmax = Activation('softmax', name='classes_softmax')(classes_concat)

# Concatenate the class and box coordinate predictions and the anchors to one large predictions tensor
# Output shape of `predictions`: (batch, n_boxes_total, n_classes + 4 + 8)
predictions = Concatenate(axis=2, name='predictions')([classes_softmax, boxes_concat, anchors_concat])

Danny Fang · Accepted Answer

It is very important how bounding box regression works in object detection. In bounding box regression, what the model predicts is the OFFSET of prediction box w.r.t. anchor box (or proposal box). Anchor box and proposal box are similar in their function sense but they are generated in different ways. Anchor boxes serve as references to the final prediction boxes (that is possibly why they are named anchor boxes)

As shown in the above figure, the model's output is Delta(x1,y1,x2,y2), given this offset together with anchor box, the coordinates of prediction box can be calculated.

So actually box_concat is the offset prediction of the model, together with anchor_concat, the final bounding box coordinates can be calculated. This can be illustrated in the decoding function of above model's prediction. See here.

y_pred (array): The prediction output of the SSD model, expected to be a Numpy array
        of shape `(batch_size, #boxes, #classes + 4 + 4 + 4)`, where `#boxes` is the total number of
        boxes predicted by the model per image and the last axis contains
        `[one-hot vector for the classes, 4 predicted coordinate offsets, 4 anchor box coordinates, 4 variances]`.

As illustrated above, box_concat contains 4 predicted coordinate offsets.

If you wonder how anchor box and together with offset can be used to calculate the bounding box, here it is. This method dates back to the famous R-CNN paper. (In Appendix C Bounding Box Regression)

Difference between box coordinate and anchor boxes in Keras

Answers (1)

Related Questions