Reputation: 1403
This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.
For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.
So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?
Upvotes: 2
Views: 1727
Reputation: 36
As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.
For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:
# In parser.c
if(pad) padding = size/2;
# In convolutional_layer.c
l.pad = padding;
Where size
is the shape of the filter.
So, for the first layer: padding = size/2 = 7/2=3
Then the output of first convolutional layer should be:
output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
Upvotes: 2
Reputation: 21
Well, I spent some time learning the source code, and learned about that the input image has added extra 3 paddings on top,down,left and right side of the image, so the image size becomes (448+2x3)=454, the out put size of valid padding should be calculated in this way: Output_size=ceil((W-F+1)/S)=(454-7+1)/2=224, therefore, output size should be 224x224x64
I hope this could be helpful
Upvotes: 2