Reputation: 4201
As in the paper I can understand SSD try to predict object locations and their relevant class scores from different feature maps .
So for each layers there can be different predictions with respect to number of anchor(reference) boxes in different scale.
So if one convolutional feature map has 5 reference boxes there should be class scores and bbx coordinates for each of the reference box .
We do above predictions by sliding a window(kernel Ex: 3*3) over the feature maps of different layers . So what I not clear is connection from sliding window at a position to score layer .
1. It just connection of convolution window output to score layer in a fully connected way ? 2.Or we do some other operation for convolution window output before connecting it to score layer ?
Upvotes: 4
Views: 1029
Reputation: 4201
The class score and bbx predictions are obtained by convolution. It's the difference between YOLO and SSD . SSD doesn't go for a fully connected way. I will explain how the score function is taken .
Above is a 8 *8 spacial sized feature map in a ssd feature extractor model. For each position in the feature map we gonna predict following
Let's say if we have k number of default (anchor) boxes we predict *(4+c)K
Now the tricky part . How we get those scores .
These set of filters will predict above (4+c) scalars.
So for a single feature map , if there are K number anchor box which we reference them in prediction ,
We have **K *(4+c) filters(3*3 in spacial location) are applied around each location of the feature map in a sliding window manner .**
We train those filter values ! .
Upvotes: 1