Reputation: 5087
Some object detection framework such as SSD (Single Shot MultiBox Detector) and Faster-RCNN have “convolutional filters” for classification and regression. The following is from SSD:
For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value.
My question is: does the numbers of “small kernels” have to be p? How about set a arbitrary number k (which is not same with feature channels)?
Upvotes: 4
Views: 269
Reputation: 17201
In the figure, the part extra Feature layers
shows how the small kernel
extracts p
vector from each of the output location, that predict detections for different aspect ratios
and class categories
.
For example, from the first convolutional feature map, p is (3x(classes+4))
, and for the second one it is (6x(classes+4))
. The numbers 3
and 6
indicate the number of anchor
boxes defined for those feature maps, and for each of those anchor boxes there are classes + 4 box coordinates
output.
So you need to fix p
based on the number of anchor boxes you decide for each feature map, the number of classes you want to detect.
My question is: does the numbers of “small kernels” have to be p? How about set a arbitrary number k (which is not same with feature channels)?
The feature channel is the result of convolution of the 3x3xp
channel so it will always takes size p which is the output channel size of the kernel. And note 3x3xp
is actually 3 x 3 x in_channels x p
, for example the first features layer is obtained by convolving 38x38x512
from the VGG with the kernel 3x3x512xp
to get 38x38xp
Upvotes: 2