Reputation: 33395
Some TensorFlow examples calculate the cost function like this:
cost = tf.reduce_sum((pred-y)**2 / (2*n_samples))
So the quotient is the number of samples, multiplied by two.
The reason for the extra factor of 2, is it so that when the cost function is differentiated for back propagation, it will cancel a factor of 1/2 and save an operation?
If so, is it still recommended to do this, does it actually provide a significant performance improvement?
Upvotes: 1
Views: 50
Reputation: 53758
It's convenient in math, because one doesn't need to carry the 0.5 all along. But in code, it doesn't make a big difference, because this change makes the gradients (and, correspondingly, the updates of trainable variables) two times bigger or smaller. Since the updates are multiplied by the learning rate, this factor of 2 can be undone by a minor change of the hyperparameter. I say minor, because it's common to try the learning rates in log-scale during model selection anyway: 0.1
, 0.01
, 0.001
, ....
As a result, no matter what particular formula is used in the loss function, its effect is negligible and doesn't lead to any training speed up. The choice of the right learning rate is more important.
Upvotes: 2