CNNs have narrower scope in each node thus larger scope in each layer?

Question

I came across this sentence describing the difference between CNNs and DNNs:

Convolutions also have a major advantage for image processing, where pixels nearby each other are much more correlated to each other for image detection.

and

Each CNN layer looks at an increasingly larger part of the image.

However, I could not understand the idea behind this. CNNs use fewer connections to the next layer, but how does narrowing each node's scope in each layer increases the scope of the next layers? Have I misunderstood anything or is there really an idea behind this?

Thanks.

Simon · Accepted Answer

Those two statements are somewhat unrelated as far as your question goes, and it sounds like it's the second statement that's causing the confusion:

Each CNN layer looks at an increasingly larger part of the image.

I believe this is referring to the fact that the input volume to a convolution/pooling layer is (usually/sometimes) larger than the output volume. How does this happen? See the following image from here:

Each "neuron" in the output volume is calculated using a convolve operation on the input. In the example image they have a 3x3x3 filter that operates over a 3x3x3 window of the input at a time. This results in a single value in the output. With the particular stride size they've chosen for their example, this process ends up creating an output volume that is smaller than the input (7x7x3 vs. 3x3x2). Note that a convolution operation doesnt always create a smaller output (depends on padding, stride size etc). However, a pooling layer will reduce the size of the output relative to input.

This relatively reduced output volume is passed onto the next convolution layer (C2). Because the input to C2 is smaller than the original image, each filter in C2 is looking at a larger part of the "image" simultaneously.

It's almost like having a fixed sized window, but the image is getting smaller each time you look through it - it would seem like you're looking at more and more of the image each time

I should note that the statement is also true even if the output size never decreases. Imagine that you have a stack of convolution layers where the output volumes are always staying the same size (for whatever reason). Each individual cell in an output volume is encoding information from multiple surrounding input cells. When you pass this output to C2, the C2 kernels will be looking at a window of the input, but that section of the input actually contains information encoded from other surrounding pixels during previous layers. So as you go through the layers, the filters look at more and more surrounding information each time

CNNs have narrower scope in each node thus larger scope in each layer?

Answers (1)

Related Questions