Reputation: 186
The function that is currently being used widely on tutorials and other place is of the form:
conv_out = conv2d(
input= x, # some 4d tensor
filters= w, # some shared variable
filter_shape= [ nkerns, stack_size, filter_height, filter_width ],
image_shape= [ batch_size, stack_size, height, width ]
)
If for the first layer of a CNN, I have filter_shape
as [ 20, 1 , 7, 7 ]
which is the number of kernals being 20, each 7 X 7, what does the '1' stand for ? My image_shape
is [100, 1, 84, 84 ]
.
This convolution now outputs a tensor of shape [ 100, 20, 26, 26]
which I understand. My next layer now takes the parameters filter_shape
= [50, 20, 5 ,5 ]
, image_shape
= [ 100, 20 ,26, 26 ]
and produces a output of shape [ 100 ,50 ,11 ,11 ]
. I seem to kind of understand this operation, except, if I want to use a '50' filters layer each working on previous 20 feature maps produced, shouldn't I produce 1000 feature maps in all instead of producing just 50 feature maps ? To restate my question, I have a stack of 20 feature maps each running 50 kernals of convolution, shouldn't my output shape be [100, 1000, 11, 11]
instead of [ 100, 50 , 11, 11]
?
Upvotes: 2
Views: 1518
Reputation: 14377
To answer your questions:
The 1
stands for the number of input channels. As you seem to be using gray scale images, this is one. For color images it can be 3. For other convolutional layers as in your second question, it must be equal to the number of outputs that the previous layer generated.
Using a filter of size [50, 20, 5, 5]
on an input signal of [100, 20, 26, 26]
is actually a good example for your first question, as well. You have here 50 filters of shape [20, 5, 5]
. Every image is of shape [20, 26, 26]
. The convolution uses all the 20 channels each time: Filter 0 gets applied to image channel 0, filter 1 gets applied to image 1, and the whole result gets summed up. Does that make sense?
Upvotes: 4