Reputation: 69
I am making a neural network by myself. And I'm now stuck at the Batch Normalization Process. The problem is that I'm not able to find any good values of gamma and beta to initialize within batch normalization. There are some tricks for the initialization of W and b but I cannot find any tips for initializing values of gamma and beta. I just want to ask if there is any tip or trick using which I can initialize gamma and beta and get decent accuracy?
Upvotes: 1
Views: 3910
Reputation: 22234
Start with gamma as 1 and beta 0. Gamma works as a L1 regularizer as shown in the original paper. If there are 3 elements in X:(x0, x1, x2) and corresponding gammas are (g0, g1, g2).Setting gamma x0 to zero means you are declaring the x0 feature does not contribute because g0 * x0
gets zero.
You want to train the network to learn each gamma so that it learns which features are to suppress and which to amplify.
One of the benefit of the Batch Normalization is that we do not have to worry about so much how to initialize the weight (he, xavier, etc). Hence I would recommend to keep such weight initialization separate from Batch Normialziation.
Upvotes: 0
Reputation: 41
PyTorch initialises the Gamma (scale) using one's and the Beta (shift) using zeroes, which you can find in both the links below:
The first link is to the direct lines in git
This second link is straight to the PyTorch docs/source
Updating my answer, the justification has been described perfectly in this blog post detailing the back-propagation of a batch-norm layer (and it is quite intuitive). The direct quote is:
We initialize the BatchNorm Parameters to transform the input to zero mean/unit variance distributions but during training they can learn that any other distribution might be better.
My interpretation is: we initialise gamma as one and beta as zeroes so that the output of the linear batch-norm transformation initially follows the standard zero-mean unit-variance normal distribution (i.e. y = Gamma*X_norm + Beta = X_norm). This provides a normalised starting point, for which the model can update the gamma and beta to scale and shift the distribution/s of each input accordingly (for the current layer).
Upvotes: 3