Why batch normalization over channels only in CNN

I am wondering, if in Convolutional Neural Networks batch normalization should be applied with respect to every pixel separately, or should I take the mean of pixels with respect to each channel?

I saw that in the description of Tensorflow's tf.layers.batch_normalization it is suggested to perform bn with respect to the channels, but if I recall correctly, I have used the other approach with good results.

Upvotes: 13

Answers (4)

Rui

Reputation: 1

The other answers have clarified that the convention is indeed for BatchNorm2D to normalize per channel across all other dimensions. Here I want to present my explanation and visualization for why that is so.

TLDR; For an input tensor X of shape (B, F, S) = (Batch Size, # of Features, Feature Size). S may represent 0 or more dimensions. Batch Norm k-D (k is a natural number) independently normalizes each feature in the F axis, calculating mean and std across axis B and S.

By its name it's apparent that we normalize across the B dimension.
We also normalize across the S dimension(s) because we treat each feature as a whole.
We don't normalize across the F dimension because different features are supposed to have different means and stds.

Examples using common notations:

0-D Inputs: X of shape (B, F, 1) such as tabular data/outputs of fully-connected layers. Each feature is a single number, thus feature size S=(1).
1-D Inputs: X of shape (B, F=E, S=T) such as time-series data/outputs of Conv1D/RNN/Transformer. E is the latent/embedding dimension, T is time-steps. Each feature is a 1-D vector, thus S=(T).
2-D Inputs: X of shape (B, F=C, (W, H)=S) such as images. Here S represents two dimensions (W, H) the width and height of the image. Each feature is a 2-D matrix, thus S=(W, H).

Here's a visualization for the 0-D to 2-D cases: BatchNorm k-D Visualized

Upvotes: 0

devforfu

Reputation: 1612

As far as I know, in feed-forward (dense) layers one applies batch normalization per each unit (neuron), because each of them has its own weights. Therefore, you normalize across feature axis.

But, in convolutional layers, the weights are shared across inputs, i.e., each feature map applies the same transformation to a different input's "volume". Therefore, you apply batch normalization using mean and variance per feature map, NOT per unit/neuron.

That's why I guess there is a difference in axis parameter value.

Upvotes: 12

Jin Wang

Reputation: 251

In CNN for images, normalization within channel is helpful because weights are shared across channels. The figure from another paper shows how we are dealing with BN. It's helpful to understand better.

figure from Group Normalization paper

Figure taken from

Wu, Y. and He, K., 2018. Group normalization. arXiv preprint arXiv: 1803.08494.

Upvotes: 25

Maverick Meerkat

Reputation: 6404

I was puzzled by this for a few hours, as it doesn't make sense to normalize per channel - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is not what normalization does - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.

After checking and testing it myself I realized what's the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.). EDIT: check this answer here.

I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.

So I'm pretty sure about this.

Upvotes: 6

Why batch normalization over channels only in CNN

Answers (4)

Related Questions