Jayhello
Jayhello

Reputation: 6602

In deep learning what's difference between two 3*3 convolution filters and one 5*5 convolution filters?

For example, as to the famous AlexNet architecutre (original paper), what's the difference of using two 3*3 convolution filters between using one 5*5 convolution filter ?

The two 3*3 convolution filters and one 5*5 convolution filter have been highlighted by red rectangle in the below image.

https://medium.com/@smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637

enter image description here

What about use another 5*5 convolution filter to supersede the two 3*3 convolution filters, or vice verse?

Upvotes: 1

Views: 3723

Answers (2)

Khalid Saifullah
Khalid Saifullah

Reputation: 795

If you still have some doubt, hope this one helps.

If you stack two 3x3 conv layers, it eventually gets a receptive field of 5 (same as one 5x5 conv layer) with respect to the input. However, the advantage of using a smaller conv layer like 3x3 is it needs less parameter (you can do the parameter calculation of two 3x3 layers and one 5x5 layer --> like 2*(33) = 18 and 1(5*5) = 25 assuming 1 channel). Also, two conv layer gets more non-linearity in between than one 5x5 layer, so it has got more discriminative power. For the receptive field part, I hope this paper of mine helps you to visualize (BTW it's the answer sheet from my exam): Receptive Field Visualization

Upvotes: 2

Jayhello
Jayhello

Reputation: 6602

I have found from paper <<Very Deep Convolutional Networks for Large-Scale Image Recognition>>.

Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv.layers (without spatial poolingin between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field.

  1. two 3*3 convolution filter is equivalent to one 5*5 convolution filter.

  2. two 3*3 convolution filter will have less parameters than one 5*5 convolution filter.

  3. two 3*3 convolution filter will make network more deep and extract more complex features than one 5*5 convolution filter.

paper:https://arxiv.org/pdf/1409.1556.pdf

Upvotes: 2

Related Questions