Mean of minibatch cross-entropy to optimize in tensorflow

Question

I tried to follow along Martin Gorner's lecture on using TensorFlow and also the tutorial at the official TensorFlow documentation.

I'm confused why on Gorner's lecture, he used the negative sum of the dot product between the labels and the predictions. But in the TensorFlow tutorial, it uses the same method, but then divides it to get the mean for each minibatch.

Basically both will work as long as you scale the learning rate, but I don't understand the reason for the difference in methods.

lballes · Accepted Answer

Using the mean instead of the sum makes the magnitude of the objective function invariant to the choice of mini-batch size. Hence, when you decide to change the mini-batch size, you can expect the same learning rate as before to still work well.

The same holds for other hyper-parameters, e.g., the L2 regularization factor.

Mean of minibatch cross-entropy to optimize in tensorflow

Answers (2)

Related Questions