Reputation: 2071
I tried to follow along Martin Gorner's lecture on using TensorFlow and also the tutorial at the official TensorFlow documentation.
I'm confused why on Gorner's lecture, he used the negative sum of the dot product between the labels and the predictions. But in the TensorFlow tutorial, it uses the same method, but then divides it to get the mean for each minibatch.
Basically both will work as long as you scale the learning rate, but I don't understand the reason for the difference in methods.
Upvotes: 0
Views: 245
Reputation: 1502
Using the mean instead of the sum makes the magnitude of the objective function invariant to the choice of mini-batch size. Hence, when you decide to change the mini-batch size, you can expect the same learning rate as before to still work well.
The same holds for other hyper-parameters, e.g., the L2 regularization factor.
Upvotes: 2
Reputation: 482
It seems that the mean can control the very different variables that its scale is very big. When you are using the sum, there no guaranty for harmonic scales fo variables. But with mean
, you are sure that there is no very different
variable.
Upvotes: 0