Tobias Hermann
Tobias Hermann

Reputation: 10926

BatchNormalization layer in Keras gives unexpected output values

Given the input values [1, 5] and normalizing them, should yield something like [-1, 1] if I understand correctly, because

mean = 3
var = 4
result = (x - mean) / sqrt(var)

However this minimal example

import numpy as np

import keras
from keras.models import Model
from keras.layers import Input
from keras.layers.normalization import BatchNormalization
from keras import backend as K

shape = (1,2,1)
input = Input(shape=shape)
x = BatchNormalization(center=False)(input) # no beta
model = Model(inputs=input, outputs=x)
model.compile(loss='mse', optimizer='sgd')

# training with dummy data
training_in = [np.random.random(size=(10, *shape))]
training_out = [np.random.random(size=(10, *shape))]
model.fit(training_in, training_out, epochs=10)

data_in = np.array([[[[1], [5]]]], dtype=np.float32)
data_out = model.predict(data_in)

print('gamma   :', K.eval(model.layers[1].gamma))
#print('beta    :', K.eval(model.layers[1].beta))
print('moving_mean:', K.eval(model.layers[1].moving_mean))
print('moving_variance:', K.eval(model.layers[1].moving_variance))

print('epsilon :', model.layers[1].epsilon)
print('data_in :', data_in)
print('data_out:', data_out)

produces the following output:

gamma   : [ 0.80644524]
moving_mean: [ 0.05885344]
moving_variance: [ 0.91000736]
epsilon : 0.001
data_in : [[[[ 1.]
   [ 5.]]]]
data_out: [[[[ 0.79519051]
   [ 4.17485714]]]]

So it is [0.79519051, 4.17485714] instead of [-1, 1].

I had a look at the source, and the values seem to be forwarded to tf.nn.batch_normalization. And this looks like the result should be what I except, but obviously it is not.

So how are the output values calculated?

Upvotes: 2

Views: 1748

Answers (2)

Tobias Hermann
Tobias Hermann

Reputation: 10926

The correct formula is this:

result = gamma * (input - moving_mean) / sqrt(moving_variance + epsilon) + beta

And here a script for verification:

import math
import numpy as np
import tensorflow as tf
from keras import backend as K

from keras.models import Model
from keras.layers import Input
from keras.layers.normalization import BatchNormalization

np.random.seed(0)

print('=== keras model ===')
input_shape = (1,2,1)
input = Input(shape=input_shape)
x = BatchNormalization()(input)
model = Model(inputs=input, outputs=x)
model.compile(loss='mse', optimizer='sgd')
training_in = [np.random.random(size=(10, *input_shape))]
training_out = [np.random.random(size=(10, *input_shape))]
model.fit(training_in, training_out, epochs=100, verbose=0)
data_in = [[[1.0], [5.0]]]
data_model = np.array([data_in])
result = model.predict(data_model)
gamma = K.eval(model.layers[1].gamma)
beta = K.eval(model.layers[1].beta)
moving_mean = K.eval(model.layers[1].moving_mean)
moving_variance = K.eval(model.layers[1].moving_variance)
epsilon = model.layers[1].epsilon
print('gamma:          ', gamma)
print('beta:           ', beta)
print('moving_mean:    ', moving_mean)
print('moving_variance:', moving_variance)
print('epsilon:        ', epsilon)
print('data_in:        ', data_in)
print('result:         ', result)

print('=== numpy ===')
np_data = [data_in[0][0][0], data_in[0][1][0]]
np_mean = moving_mean[0]
np_variance = moving_variance[0]
np_offset = beta[0]
np_scale = gamma[0]
np_result = [np_scale * (x - np_mean) / math.sqrt(np_variance + epsilon) + np_offset for x in np_data]
print(np_result)

print('=== tensorflow ===')
tf_data = tf.constant(data_in)
tf_mean = tf.constant(moving_mean)
tf_variance = tf.constant(moving_variance)
tf_offset = tf.constant(beta)
tf_scale = tf.constant(gamma)
tf_variance_epsilon = epsilon
tf_result = tf.nn.batch_normalization(tf_data, tf_mean, tf_variance, tf_offset, tf_scale, tf_variance_epsilon)
tf_sess = tf.Session()
print(tf_sess.run(tf_result))

print('=== keras backend ===')
k_data = K.constant(data_in)
k_mean = K.constant(moving_mean)
k_variance = K.constant(moving_variance)
k_offset = K.constant(beta)
k_scale = K.constant(gamma)
k_variance_epsilon = epsilon
k_result = K.batch_normalization(k_data, k_mean, k_variance, k_offset, k_scale, k_variance_epsilon)
print(K.eval(k_result))

Output:

gamma:           [ 0.22297101]
beta:            [ 0.49253803]
moving_mean:     [ 0.36868709]
moving_variance: [ 0.41429576]
epsilon:         0.001
data_in:         [[[1.0], [5.0]]]
result:          [[[[ 0.71096909]
   [ 2.09494853]]]]

=== numpy ===
[0.71096905498374263, 2.0949484904433255]

=== tensorflow ===
[[[ 0.71096909]
  [ 2.09494853]]]

=== keras backend ===
[[[ 0.71096909]
  [ 2.09494853]]]

Upvotes: 0

gdelab
gdelab

Reputation: 6220

If you're using gamma, the right equation is actually result = gamma * (x - mean) / sqrt(var) for batch normalization, BUT mean and var are not always the same:

  • During training (fit), they are mean_batch and var_batch calculated using the input values of the batch (they are just the mean and variance of your batch)), just as you're doing. In the meanwhile, a global moving_mean and moving_variance are learnt this way: moving_mean = alpha * moving_mean + (1-alpha) * mean_batch, with alpha is a kind of learning rate, in (0,1), usually above 0.9. moving_mean and moving_varianceare approximations of the real mean and variance of all your training data. Gamma is also learnt, by usual gradient descent, to best fit your output.

  • During inference (predict), you just use the learnt values of moving_mean and moving_variance, not at all mean_batch and var_batch. You also use the learnt gamma.

So 0.05885344 is just an approximation of the mean of your random input data, 0.91000736 of its variance, and you're using these to normalize your new data [1, 5]. You can easily chack check that [0.79519051, 4.17485714]=gamma * ([1, 5] - moving_mean)/sqrt(moving_var)

edit: alpha is called momentum in keras, if you want to check it.

Upvotes: 2

Related Questions