seermer
seermer

Reputation: 651

Why does CNNs usually have a stem?

Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?

examples of stems:
classic ResNet: Conv2d+MaxPool:
resnet config

bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:
resnet-cclassic residual block

there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.

cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385

Upvotes: 4

Views: 4588

Answers (3)

Andrea Gurioli
Andrea Gurioli

Reputation: 23

Stem layers work as a compression mechanism over the initial image. This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs.

Upvotes: 1

Serhii Maksymenko
Serhii Maksymenko

Reputation: 319

As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.

Upvotes: 6

Gilles Ottervanger
Gilles Ottervanger

Reputation: 671

This is because these specialized modules can do no more than just convolutions. The difference is in the trainability of the resulting architecture. For example, the skip connections in ResNet are meant to bypass some layers when these are still so badly trained that they do not propagate the useful information from the input to the output. However, when fully trained, the skip connections could in theory be completely removed (or integrated) since the information can still propagate throught the layers that would otherwise be skipped. However, when you are using a backbone that you dont intend to train yourself, it does not make sence to include architectural features that are aimed at trainability. Instead, you can "compless" the backbone leaving only relatively fundamental operations and freeze all weights. This saves computational costs both when training the head as well as in the final deployment.

Upvotes: 1

Related Questions