Reputation: 85
Consider the following code snippet
model = models.Sequential()
model.add(layers.Dense(256, activation='relu')) # Layer 1
model.add(BatchNormalization())
model.add(layers.Dense(128, activation='relu')) # Layer 2
I am using Keras with Tensorflow backend.
My question is - Is BN performed before or after activation function in Keras's implementation?
To add more clarity,
Whether BN SHOULD be applied before or after activation is subject to debate, the original (Ioffe and Szegedy 2015) paper suggests "BEFORE", but comments from the below thread show diverse opinions. Ordering of batch normalization and dropout?
In Keras documentation (https://keras.io/layers/normalization/), it says "Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1."
Keras's doc seems to suggest that BN is applied AFTER activation (i.e. in the example code above, BN applied after 'relu' on layer 1). I would like to confirm if this is the case?
In addition, is it possible to configure whether BN is applied before or after activation function?
Thanks!
Upvotes: 4
Views: 5149
Reputation: 996
In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:
It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.
In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.
Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It is probably best to test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.
Upvotes: 3
Reputation: 2285
To add BatchNorm
after or before activation
is still an open debate. The original version suggested by the authors works well and have been used in many implementations. But many people have found that BN after activation really works well and helps in faster convergence. For example, check the discussion in this thread.
In short, it depends on the task! Which one is gonna perform better? You have to check that for yourself. And yes, you can control the order. For example:
x = Conv2D(64, (3,3), activation=None)(inputs)
x = BatchNormalization()(x)
x = Activation("relu")(x)
or
x = Conv2D(64, (3,3), activation="relu")(inputs)
x = BatchNormalization()(x)
Upvotes: 15