Reputation: 3171
Generally we will insert max-pooling layers between convolution layers. The main idea is to "summarize" the features in conv. layers. But it's hard to decide when to insert. I have some questions behind this:
how to decide how many conv. layers until we insert a max-pooling. and what's the effect of too many/few conv. layers
as max-pooling will reduce the size. so if we want to use very deep network, we can not do many maxpooling otherwise the size is too small. For example, the MNIST only have 28x28 input, but I do see some people use very deep network to experiment with it, so they might end up with very small size? actually when the size is too small (extreme case, 1x1), its' just like a fully-connected layer, and it seems doing convolution on them doesn't make any sense.
I know there is no golden role but I just want to figure the basic intuition behind this, so that I can make reasonable choice when implement a network
Upvotes: 5
Views: 5682
Reputation: 53778
You are right, there's no one best way to do it, just like there's no one best filter size or one best neural network architecture in general.
VGG-16 uses 2-3 convolutional layers between the pooling layers (the picture below), VGG-19 uses up to 4 layers, ...
.. and GoogleNet applies incredible number of convolutions (the picture blow), in between and sometimes in parallel with maxpooling layers
Each new layer, obviously, increases the network flexibility, so that it can approximate the more complex target functions. On the other hand, it requires more computation for training, however it's common to save computation using the 1x1 convolution trick. How much flexibility does your network need? Greatly depends on the data, but usually 2-3 layers is flexible enough for most applications, and additional layers don't affect the performance. There's no better strategy than to cross-validate models of various depth. (The pictures are from this blog-post)
This is a known issue and I'd like to mention here one particular technique that deals with too aggressive downsampling: Fractional Pooling. The idea is to apply different-size receptive fields for different neurons in the layer to reduce the image by any ratio: 90%, 75%, 66%, etc.
This is one of ways to make deeper networks particularly for small images, like MNIST digits, that demonstrated very good accuracy (0.32% test error).
Upvotes: 6