Reputation: 11
I am not understanding how the 20 filters of the second Conv2D layer filters the 10 feature maps outputted by the first Conv2D layer. Are each of the 20 filters filtering each of the 10 featuremaps (outputted from the first Conv2D layer)? If so why isn't the output of the second Conv2D layer 20 X 10 = 200 featuremaps? Instead the output from the second Conv2d layer appears to be 20 featuremaps (None, 8, 8, 20) as shown on the chart below.
This is from the text Generative deep learning by David Foster
input_layer = layers.Input(shape=(32,32,3))
conv_layer_1 = layers.Conv2D(
filters = 10
, kernel_size = (4,4)
, strides = 2
, padding = 'same'
)(input_layer)
conv_layer_2 = layers.Conv2D(
filters = 20
, kernel_size = (3,3)
, strides = 2
, padding = 'same'
)(conv_layer_1)
flatten_layer = layers.Flatten()(conv_layer_2)
output_layer = layers.Dense(units=10, activation = 'softmax')
(flatten_layer) model = models.Model(input_layer, output_layer)
InputLayer (None, 32, 32, 3) 0
Conv2D (None, 16, 16, 10) 490
Conv2D (None, 8, 8, 20) 1,820
Flatten (None, 1280) 0
Dense (None, 10) 12,810
Upvotes: 0
Views: 104
Reputation: 5451
The input to the 2nd convlayer has 10
channels. Convolving with a 3×3×10
kernel means producing a weighted sum (where the weights are the values in the kernel) for each input position (i.e. for each possible 3×3
neighborhood) across all 10
channels (I ignore the strides and padding here, since they don't affect the number of output channels). This means, as a consequence, that for each position, each filter's kernel produces exactly 1
output. Since you have 20
filters, the layer concatenates their 20·1
outputs to 20
channels.
I guess your misconception here is, that each each filter applies a 3×3
kernel separately to each of the 10
channels, and thus produces 10
weighted sums, corresponding to 10
outputs for each position per filter. That is not the case: again, each filter applies its 3×3×10
kernel across all 10
channels at once to produce 1
output per position. Applying the convolution kernels across all channels to produce one combined result value per position is the general rule for (regular) convlayers. What you (presumably) had in mind would correspond to a grouped convolution, where the number of groups corresponds to the number of channels.
You could have already asked the same question for the 1st convlayer, by the way: If there are 3
input channels and 10
filters, why aren't there 3·10=30
channels in its output? The reason is exactly the same: For each position, each of the 10
filters produces exactly 1
output across all 3
channels; so for each position, 10·1
outputs are concatenated to 10
channels.
Upvotes: 1
Reputation: 102529
The output from first conv2D
is of size ?x16x16x10
, which is followed by 20
filter of size 3x3x10
.
For each slice from the ?
dimmension, e.g., number of samples, 16x16x10
is convoluted with a filter 3x3x10
, but do it 20
times (since there are 20
different filters). Given arguments padding = 'same'
and strides = 2
, the size of each convoluted slice ceil(16/2)xceil(16/2)=8x8
(you can refer to this article for its computation), while 20
-time filtering expands the dimension to 8x8x20
.
Upvotes: 0