Gegenwind
Gegenwind

Reputation: 1428

Train neural network: Mathematical reason for Nan due to batch size

I am training a CNN. I use Googles pre-trained inceptionV3 with a replaced last layer for classification. During training, I had a lot of issues with my cross entropy loss becoming nan.After trying different things (reducing learning rate, checking the data etc.) it turned out the training batch size was too high.

Reducing training batch size from 100 to 60 solved the issue. Can you provide an explanation why too high batch sizes cause this issue with a cross entropy loss function? Also is there a way to overcome this issue to work with higher batch sizes (there is a paper suggesting batch sizes of 200+ images for better accuracy)?

Upvotes: 3

Views: 3289

Answers (1)

Nipun Wijerathne
Nipun Wijerathne

Reputation: 1829

Larger weights (resulting exploding gradients) of the network produces skewed probabilities in the soft max layer. For example, [0 1 0 0 0 ] instead of [0.1 0.6 0.1 0.1 0.1]. Therefore, produce numerically unstable values in the cross entropy loss function.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_))

when y_ = 0, cross_entropy becomes infinite (since 0*log(0)) and hence nan.

The main reason for weights to become larger and larger is the exploding gradient problem. Lets consider the gradient update,

∆wij = −η ∂Ei/ ∂wi

where η is the learning rate and ∂Ei/∂wij is the partial derivation of the loss w.r.t weights. Notice that ∂Ei/ ∂wi is the average over a mini-batch B. Therefore, the gradient will depend on the mini-batch size |B| and the learning rate η.

In order to tackle this problem, you can reduce the learning rate. As a rule of thumb, its better to set the initial learning rate to zero and increase by a really small number at a time to observe the loss.

Moreover, Reducing the mini batch size results in increasing the variance of stochastic gradient updates. This sometimes helps to mitigate nan by adding noise to the gradient update direction.

Upvotes: 5

Related Questions