Reputation: 1049
I'm reading the learning with Python book from Francois Chollet. On page 128, the author is discussing the problem of having continuous Conv2d layer instead of having a maxpooling layer. My question is from the following paragraph. I don't get where 7X7 come from?
It isn’t conducive to learning a spatial hierarchy of features. The 3 × 3 windows in the third layer will only contain information coming from 7 × 7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7 × 7 pixels!). We need the features from the last convolution layer to contain information about the totality of the input.
Layer (type) Output Shape Param #
conv2d_4 (Conv2D) (None, 26, 26, 32) 320
conv2d_5 (Conv2D) (None, 24, 24, 64) 18496
conv2d_6 (Conv2D) (None, 22, 22, 64) 36928
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
Upvotes: 3
Views: 773
Reputation: 745
I assume that your cnn architecture only has 3*3 kernels.
Thanks to a 3*3 kernel, the first layer is creating features map from your input. Each pixel of these feature maps are only dependant on a 3*3 square of the input. Than the second layer is doing the exact same thing, taking the feature maps as input. So now, one pixel is dependant on a 3*3 square of a feature map, which is dependant on a 5*5 square of the input.
By doing that a third time, a pixel on a third layer feature map only depends on a 7*7 window of the input.
Here is a 1D example :
* # third layer pixel
| | |
* * * # second layer pixels
| | | | |
* * * * * # first layer pixels
| | | | | | |
* * * * * * * # input pixels --> An unique third layer layer pixel depends on only 7 input pixels
Upvotes: 6