Understanding input/output tensors from tf.layers.conv2d

Question

I'm trying to understand the transformation performed by tf.layers.conv2d.

The mnist tutorial code from the TensorFlow website includes the convolution layer:

# Computes 64 features using a 5x5 filter.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 14, 14, 32]
# Output Tensor Shape: [batch_size, 14, 14, 64]
conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

However, my expectation is that the 32 input images would be multiplied by the number of filters, as each filter is applied to each image, to give an output tensor of [batch_sz, 14, 14, 2048]. Clearly this is wrong, but I don't know why. How does the transformation work? The API documentation tells me nothing about how it works. What would be the output if the input tensor was [batch_size, 14, 14, 48]?

Y. Luo · Accepted Answer

I think you might have a minor misunderstanding of how filter works here. This introduction and this answer provide some detailed explanation. I found the Convolution Demo animation in the introduction is extremely helpful in showing how it works.

The key point here is how the filter works. Usually, convolutional layer has a set of K filters (64 in your example). For each filter, the actual shape is kernel_size + depth_of_input (5x5x32 in your example). That means one filter will look/apply onto 32 channels/images all at once and gives one conclusion/computed_feature. Therefore, the depth/num_of_features of output is equal to your filters argument rather than input_depth*filters. Please check this code to get an idea about the real and final kernel for computation.

Therefore, to answer your last question, the output of either [batch_size, 14, 14, 32] or [batch_size, 14, 14, 48] will always be [batch_size, 14, 14, 64] for your setting.

Understanding input/output tensors from tf.layers.conv2d

Answers (2)

Related Questions