Steven Dascoli
Steven Dascoli

Reputation: 11

How does the Conv2D method filter the featuremaps outputted by another Conv2D layer?

I am not understanding how the 20 filters of the second Conv2D layer filters the 10 feature maps outputted by the first Conv2D layer. Are each of the 20 filters filtering each of the 10 featuremaps (outputted from the first Conv2D layer)? If so why isn't the output of the second Conv2D layer 20 X 10 = 200 featuremaps? Instead the output from the second Conv2d layer appears to be 20 featuremaps (None, 8, 8, 20) as shown on the chart below.

This is from the text Generative deep learning by David Foster This is from the text Generative deep learning  by David Foster

input_layer = layers.Input(shape=(32,32,3))
conv_layer_1 = layers.Conv2D(
   filters = 10
   , kernel_size = (4,4)
   , strides = 2
   , padding = 'same'
   )(input_layer)
conv_layer_2 = layers.Conv2D(
    filters = 20
   , kernel_size = (3,3)
   , strides = 2
    , padding = 'same'
   )(conv_layer_1)
flatten_layer = layers.Flatten()(conv_layer_2)
output_layer = layers.Dense(units=10, activation = 'softmax') 

(flatten_layer) model = models.Model(input_layer, output_layer)

InputLayer (None, 32, 32, 3) 0

Conv2D (None, 16, 16, 10) 490

Conv2D (None, 8, 8, 20) 1,820

Flatten (None, 1280) 0

Dense (None, 10) 12,810

Upvotes: 0

Views: 104

Answers (2)

simon
simon

Reputation: 5451

The input to the 2nd convlayer has 10 channels. Convolving with a 3×3×10 kernel means producing a weighted sum (where the weights are the values in the kernel) for each input position (i.e. for each possible 3×3 neighborhood) across all 10 channels (I ignore the strides and padding here, since they don't affect the number of output channels). This means, as a consequence, that for each position, each filter's kernel produces exactly 1 output. Since you have 20 filters, the layer concatenates their 20·1 outputs to 20 channels.

I guess your misconception here is, that each each filter applies a 3×3 kernel separately to each of the 10 channels, and thus produces 10 weighted sums, corresponding to 10 outputs for each position per filter. That is not the case: again, each filter applies its 3×3×10 kernel across all 10 channels at once to produce 1 output per position. Applying the convolution kernels across all channels to produce one combined result value per position is the general rule for (regular) convlayers. What you (presumably) had in mind would correspond to a grouped convolution, where the number of groups corresponds to the number of channels.

You could have already asked the same question for the 1st convlayer, by the way: If there are 3 input channels and 10 filters, why aren't there 3·10=30 channels in its output? The reason is exactly the same: For each position, each of the 10 filters produces exactly 1 output across all 3 channels; so for each position, 10·1 outputs are concatenated to 10 channels.

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

The output from first conv2D is of size ?x16x16x10, which is followed by 20 filter of size 3x3x10.

For each slice from the ? dimmension, e.g., number of samples, 16x16x10 is convoluted with a filter 3x3x10, but do it 20 times (since there are 20 different filters). Given arguments padding = 'same' and strides = 2, the size of each convoluted slice ceil(16/2)xceil(16/2)=8x8 (you can refer to this article for its computation), while 20-time filtering expands the dimension to 8x8x20.

Upvotes: 0

Related Questions