CNN network, continue using conv2d, without using maxpool, a question from reading the keras book

Question

I'm reading the learning with Python book from Francois Chollet. On page 128, the author is discussing the problem of having continuous Conv2d layer instead of having a maxpooling layer. My question is from the following paragraph. I don't get where 7X7 come from?

It isn’t conducive to learning a spatial hierarchy of features. The 3 × 3 windows in the third layer will only contain information coming from 7 × 7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7 × 7 pixels!). We need the features from the last convolution layer to contain information about the totality of the input.

Layer (type) Output Shape Param #
================================================================
conv2d_4 (Conv2D) (None, 26, 26, 32) 320
________________________________________________________________
conv2d_5 (Conv2D) (None, 24, 24, 64) 18496
________________________________________________________________
conv2d_6 (Conv2D) (None, 22, 22, 64) 36928
================================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0

Pierre-Nicolas Piquin · Accepted Answer

I assume that your cnn architecture only has 3*3 kernels.

Thanks to a 3*3 kernel, the first layer is creating features map from your input. Each pixel of these feature maps are only dependant on a 3*3 square of the input. Than the second layer is doing the exact same thing, taking the feature maps as input. So now, one pixel is dependant on a 3*3 square of a feature map, which is dependant on a 5*5 square of the input.

By doing that a third time, a pixel on a third layer feature map only depends on a 7*7 window of the input.

Here is a 1D example :

        *            # third layer pixel
      | | |
      * * *          # second layer pixels
    | | | | |
    * * * * *        # first layer pixels
  | | | | | | |
  * * * * * * *      # input pixels --> An unique third layer layer pixel depends on only 7 input pixels

CNN network, continue using conv2d, without using maxpool, a question from reading the keras book

Answers (1)

Related Questions