Conv 1x1 configuration for feature reduction

Question

I am using 1x1 convolution in the deep network to reduce a feature x: Bx2CxHxW to BxCxHxW. I have three options:

x -> Conv (1x1) -> Batchnorm-->ReLU. Code will be output = ReLU(BN(Conv(x))). Reference resnet
x -> BN -> ReLU-> Conv. So the code will be output = Conv(ReLU(BN(x))) . Reference densenet
x-> Conv. The code is output = Conv(x)

Which one is most using for feature reduction? Why?

Shai · Accepted Answer

Since you are going to train your net end-to-end, whatever configuration you are using - the weights will be trained to accommodate them.

BatchNorm?
I guess the first question you need to ask yourself is do you want to use BatchNorm? If your net is deep and you are concerned with covariate shifts then you probably should have a BatchNorm -- and here goes option no. 3

BatchNorm first?
If your x is the output of another conv layer, than there's actually no difference between your first and second alternatives: your net is a cascade of ...-conv-bn-ReLU-conv-BN-ReLU-conv-... so it's only an "artificial" partitioning of the net into triplets of functions conv, bn, relu and up to the very first and last functions you can split things however you wish. Moreover, since Batch norm is a linear operation (scale + bias) it can be "folded" into an adjacent conv layer without changing the net, so you basically left with conv-relu pairs.
So, there's not really a big difference between the first two options you highlighted.

What else to consider?
Do you really need ReLU when changing dimension of features? You can think of the reducing dimensions as a linear mapping - decomposing the weights mapping to x into a lower rank matrix that ultimately maps into c dimensional space instead of 2c space. If you consider a linear mapping, then you might omit the ReLU altogether.
See fast RCNN SVD trick for an example.

Conv 1x1 configuration for feature reduction

Answers (1)

Related Questions