Reputation: 3070
I am reading the paper "Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016" I want to understand why it can work for semantic segmentation. Let look at the fcn-32s architecture, it includes two phases: feature extraction (conv1-1->pool5), and feature classification (fc6->score_fr). Comparison with a normal classification network, the main different is the second phase. The FCN-32s replaces the fully connected layer by fully convolution layers (1 x 1) in fc7 to retain the spatial map (as the caption in the figure 2 of the paper). Hence, I was confused something about this point:
Thank you in advance.
Update: This is figure to show how convert from fully connected to fully convolution layers
Upvotes: 3
Views: 1099
Reputation: 114786
If you look at the math, "Convolution"
layer and "InnerProduct"
(aka "fully connected") layer are basically quite similar: they perform a linear operation on their respective receptive fields. The only difference is that "InnerProduct"
takes the entire input as its "receptive field" while "Convolution"
layer only looks at kernel_size
window in the input.
What happens if the input size is changed?
"Convolution"
layer cannot care less, it simply outputs the feature map with spatial dimensions corresponding to the new input shape.
On the other hand, "InnerProduct"
layer fails since the number of weights it has does not match the new size of the receptive field.
Replacing the top fully-connected layers in a model with "Convolution"
layers allow for "sliding window" classification of the image: thus achieving coarse semantic segmentation - a labeling per-pixel rather than a label per image.
There is still a big issue of the scale gap between the input scale and the coarse scale of the output labels, but there are "Deconvolution"
layers to bridge that gap.
Upvotes: 6