Architecture of VGGnet. What is multi-crop, dense evaluation?

Question

I was reading the VGG16 paper very deep convolutional networks for large-scale image recognition

In 3.2 TESTING, It talks that all fully-connected layers are replaced by some CNN layers

Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled)

So the architecture of VGG16(Configuration D) when predict on testing set will be

input=(224, 224)
conv2d(64, (3,3))
conv2d(64, (3,3))
Maxpooling(2, 2)
conv2d(128, (3,3))
conv2d(128, (3,3))
Maxpooling(2, 2)
conv2d(256, (3,3))
conv2d(256, (3,3))
conv2d(256, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
Dense(4096) is replaced by conv2d((7, 7))
Dense(4096) is replaced by conv2d((1, 1))
Dense(1000) is replaced by conv2d((1, 1))

So this architecture only uses for testing set?

Does the last 3 CNN layers all have 1000 channels?

The result is a class score map with the number of channels equal to the number of classes

Since the input size is 224*224, the size of output after the last Maxpooling layer will be (7 * 7). Why does it say a variable spatial resolution? I know it do multi-class scale, but it will be cropped to a (224, 224) image before input.

And How VGG16 gets a (1000, ) vector? What is spatially average(sum-pooled) in here? Does it just add a sum pooling layer with size (7, 7) to get a (1, 1, 1000) array?

the class score map is spatially averaged (sum-pooled)

In 3.2 TESTING

Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.

So the multi-crop and dense evaluation will be used only on the validation set?

Let's say the input size is (256, 256), multi-crop might get a size of (224, 224) image, where the centre of the cropped image may be different, say [0:223, 0:223] or [1:224, 1:224]. Is my understand of multi-crop correct?

And what is dense evaluation? I am trying to google them, but cannot get relevant results.

Pankaj Mishra · Accepted Answer

the main idea of changing the dense layer to the convolutional layer is to make the inference input image size independent. Suppose you have (224,224) size image, then your network with FC will work nicely, but as soon as the image size is changed, your network will start throwing size mismatch error (which means your network is image size dependent).

Hence, to counter such things, a complete convolutional network is made where the features are stored in the channel while the size of the image is average using an average pooling layer or even convolutional steps to this dimension (channel=number_of_classification classes,1,1). So when you flatten this last outcome, it will come as *number_of_classes = channel*1*1.*

I am not attaching a complete code for this, because your complete questions will need more detailed answers while defining lots of basics. I encourage you to read the full connected convolutional network to get the idea. It's easy and I am 100% sure you will understand the nitty-gritty of that.

Architecture of VGGnet. What is multi-crop, dense evaluation?

Answers (1)

Related Questions