Are individual gradients in a batch summed or averaged in a Neural Network?

Question

I am building a neural network from scratch. Currently having a batch of 32 training examples, and for each individual example, I calculate the derivatives (gradient) and sum them.

After I sum the 32 training examples' gradients, I apply:weight += d_weight * -learning rate;

The question is: Should I sum (as of now) or average the 32 gradients?

Or alternative solution:

Should I calculate each 32 gradients for each loss output (as of now), or average the cross entropy loss outputs and then calculate a single gradient?

I have looked at multiple sources and it's not clear what the answer is. Also the optimal learning rate in my software is lower than 0.0001 for Mnist training. That is different than the 0.01 to 0.05 that I have seen in other neural networks.

J. Lee · Accepted Answer

Well, it depends on what you want to achieve. The loss function acts as a guide to train the neural network to become better at a task.

If we sum the cross entropy loss outputs, we incur more loss in proportion to the batch size, since our loss grows linearly in proportion to the mini-batch size during training.

Whereas, if we take the average, our loss is indifferent to the batch size since we are taking an average.

For your use case, I recommend taking the average, as that ensures that your loss function is decoupled from hyperparameters such as the aforementioned batch size.

Another intuitive example is that by averaging the loss, we normalize the loss output and that also helps stabilize training, since our network becomes less sensitive to learning rate. If in the case we use sum, we might get exploding gradient issues, which forces us to use a much lower learning rate, thus making our network more sensitive to hyperparameter values.

Are individual gradients in a batch summed or averaged in a Neural Network?

Answers (1)

Related Questions