DaveTheAl
DaveTheAl

Reputation: 2155

Keras BatchNorm: Training accuracy increases while Testing accuracy decreases

I am trying to use BatchNorm in Keras. The training accuracy increases over time. From 12% to 20%, slowly but surely. The test accuracy however decreases from 12% to 0%. Random baseline is 12%.

I very much assume this is due to the batchnorm layer (removing the batchnorm layer results in ~12% test accuracy), which maybe does not initialize parameters gamma and beta well enough. Do I have to regard anything special when applying batchnorm? I don't really understand what else could have gone wrong. I have the following model:

model = Sequential()

model.add(BatchNormalization(input_shape=(16, 8)))
model.add(Reshape((16, 8, 1)))

#1. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))

#2. Conv (64 filters; 3x3 kernel)
model.add(default_Conv2D())
model.add(BatchNormalization(axis=3))
model.add(Activation('relu'))

... 

#8. Affine (NUM_GESTURES units) Output layer
model.add(default_Dense(NUM_GESTURES))
model.add(Activation('softmax'))


sgd = optimizers.SGD(lr=0.1)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

default_Conv2D and default_Dense are defined as follows:

def default_Conv2D():
    return Conv2D(
        filters=64,
        kernel_size=3,
        strides=1,
        padding='same',
        # activation=None,
        # use_bias=True,
        # kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), #RandomUniform(),
        kernel_regularizer=regularizers.l2(0.0001),
        # bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None), # RandomUniform(),
        # bias_regularizer=None
    )

def default_Dense(units):

    return Dense(
        units=units,
        # activation=None,
        # use_bias=True,
        # kernel_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
        # bias_initializer=RandomNormal(mean=0.0, stddev=0.01, seed=None),#RandomUniform(),
        kernel_regularizer=regularizers.l2(0.0001),
        # bias_regularizer=None
    )

Upvotes: 2

Views: 3104

Answers (2)

DaveTheAl
DaveTheAl

Reputation: 2155

It seems that there was something broken with Keras itself.

A naive

pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

did the trick.

@wontonimo, thanks a lot for your really great answer!

Upvotes: 0

Anton Panchishin
Anton Panchishin

Reputation: 3763

The issue is overfitting.

This is supported by your first 2 observations :

  1. The training accuracy increases over time. From 12% to 20%,.. test accuracy however decreases from 12% to 0%
  2. removing the batchnorm layer results in ~12% test accuracy

The first statement tells me that your network is memorizing the training set. The second statement tells me that when you prevent the network from memorizing the training set (or even learning) then it stops making error to do with memorization.

There are a few solutions to overfitting, but it is a problem large than this post. Please treat the following list as a "top" list and not exhaustive:

  • add a regularizer like Dropout just before your final fully connected layer.
  • add a L1 or L2 regularizer on matrix weights
  • add a regularizer like Dropout between CONV
  • your network may have too many free parameters. try reducing the layers to just 1 CONV, and add one more layer at a time retraining and testing each time.

slow increase in accuracy

As a side note, you hinted that your accuracy isn't increasing as fast as you like by saying slowly but surely. I've had great success when I've done all of the following steps

  • change your loss function to be the average loss of all predictions for all items in the mini-batch. This makes your loss function independent of your batch size which you'll discover that if you change your batch size and your loss function changes with it then you'll have to change your learning rate in SGD.
  • your loss is a single number that is the average of the loss for all predicted classes and all samples, so use a learning rate of 1.0. No need to scale it anymore.
  • use tf.train.MomentumOptimizer with learning_rate = 1.0 and momentum = 0.5. MomentumOptimizer has been shown to be much more robust than GradientDescent.

Upvotes: 4

Related Questions