How does a convolutional kernel transform images with 3 channels into multiple channels? What does the last argument mean?

Question

I trained the ResNet50V2 model and I was wondering how the tensors transform from 3 channels to n channels. I have the model as:

model.summary()

Model: "model_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_9 (InputLayer)            (None, 164, 164, 3)  0                                            
__________________________________________________________________________________________________
conv1_pad (ZeroPadding2D)       (None, 170, 170, 3)  0           input_9[0][0]                    
__________________________________________________________________________________________________
conv1_conv (Conv2D)             (None, 82, 82, 64)   9472        conv1_pad[0][0]                  
__________________________________________________________________________________________________
pool1_pad (ZeroPadding2D)       (None, 84, 84, 64)   0           conv1_conv[0][0]                 
__________________________________________________________________________________________________
...
...
...
...
...
...
post_relu (Activation)          (None, 6, 6, 2048)   0           post_bn[0][0]                    
__________________________________________________________________________________________________
flatten_9 (Flatten)             (None, 73728)        0           post_relu[0][0]                  
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, 37)           2727973     flatten_9[0][0]                  
==================================================================================================
Total params: 26,292,773
Trainable params: 26,247,333
Non-trainable params: 45,440

The first convolution layer "conv1_conv" has a filter:

filters= layer.get_weights()[2]  #conv1_conv layer
print(layer.name, filters.shape)

Output:

conv1_conv (7, 7, 3, 64)

What I don't understand is the convolution operation that makes the (170,170,3) tensor convert to (82,82,64) tensor.

What does the 64 in the conv1_conv indicate?

Jindřich · Accepted Answer

You can imagine the convolution as a sliding window of size 7 × 7 sliding over the image. Each filter takes a window of the image, here 7 × 7 × 3 numbers a makes a linear projection into a single number. You need 7*7*3 parameters for linear projection for each filter and you have 64 of them, therefore the shape of the convolution 7 × 7 × 3 × 64.

The other important property of the convolution is stride: this is a step by which the window moves. You have window size 7 and the image has width and height 170, i.e., the sliding window needs to pass 170-7=163 pixels. If you do it with stride 2, it means 163/2=81.5 windows, rounded to 82. Each of the windows gets projected with 64 filters, therefore the shape 82 × 82 × 64.

How does a convolutional kernel transform images with 3 channels into multiple channels? What does the last argument mean?

Answers (1)

Related Questions