When training a model over multiple GPUs on the same machine using Pytorch, how is the batch size divided?

Question

Even looking through Pytorch forums I'm still not certain about this one. Let's say I'm using Pytorch DDP to train a model over 4 GPUs on the same machine.

Suppose I choose a batch size of 8. Is the model theoretically backpropagating over 2 examples every step and the final results we see are for a model trained with a batch size of 2, or does the model gather the gradients together at every step to get the result from each GPU and backpropagate with a batch size of 8?

When training a model over multiple GPUs on the same machine using Pytorch, how is the batch size divided?

Answers (1)

Related Questions