Reputation: 81
Can someone please explain to me why the width of the VGG16 network is 64 in the first convolutional layer? I understand that the layers double in size along the network but I am not sure how 64 was determined in the beginning.
Upvotes: 3
Views: 1143
Reputation: 178
The input to the first convolutional layer in VGG16 is an image with size 224x224x3. The output volume of the first convolutional layer has the shape 224x244x64 (x3 for each channel in the input image). The value 64 is the depth (or channels - in the paper they call it width which is confusing IMO) of the new volume as the result of the convolution operation on each of the 64 filters over the input volume (image) - think of each filter stacking a new layer onto the volume. The choice of 64 filters in conv1_1 was a design decision that they don't explain but is related to managing the number of trainable parameters.
The doubling of the number of filters (64, 128, 256 ...) is also a design decision. Some people say that a rule of thumb is to increase the number of filters by the inverse of the multiplier in the downsampling operation of the pooling layer. In the VGG16 architecture they use a stride of 2 in their pooling layer. Therefore they roughly downsample the WxH of the input volume by 50% according to this rough equation (width and height are equal):
Widthoutput = ( Widthinput − FilterSize + 2*Padding ) / Stride +1
In VGG16 pool1 output:
Widthoutput = ( 224 − 3 + 2*0 ) / 2 +1 = 111.5 ≅ 112
Pooling downsamples by 50% (224/2), so lets double the filters in the next convolutional layer (64*2)
Upvotes: 2