Faster RCNN: how to translate coordinates

Question

I'm trying to understand and use the Faster R-CNN algorithm on my own data.

My question is about ROI coordinates: what we have as labels, and what we want in the end, are ROI coordinates in the input image. However, if I understand it correctly, anchor boxes are given in the convolutional feature map, then the ROI regression gives ROI coordinates relatively to an anchor box (so easily translatable to coordinates in conv feature map coordinates), and then the Fast-RCNN part does the ROI pooling using the coordinates in the convolutional feature map, and itself (classifies and) regresses the bounding box coordinates.

Considering that between the raw image and the convolutional features, some convolutions and poolings occured, possibly with strides >1 (subsampling), how do we associate coordinates in the raw images to coordinates in feature space (in both ways) ?

How are we supposed to give anchor boxes sizes: relatively to the input image size, or to the convolutional feature map ?

How is the bounding box regressed by Fast-RCNN expressed ? (I would guess: relatively to the ROI proposal, similarly to the encoding of the proposal relatively to the anchor box; but I'm not sure)

gdelab · Accepted Answer

It looks like it's actually an implementation question, the method itself does not answer that.

A good way to do it though, that is used by Tensorflow Object Detection API, is to always give coordinates and ROI sizes relatively to the layer's input size. That is, all coordinates and sizes will be real numbers between 0 and 1. Likewise for the anchor boxes.

This handles nicely the problem of the downsampling, and allows easy computations of ROI coordinates.

Faster RCNN: how to translate coordinates

Answers (2)

Related Questions