Reputation: 69
In this TensorFlow tutorial, you can use N number of GPUs to distribute N mini-batches (each containing M training samples) to each GPU and calculate the gradients concurrently.
Then you average the gradients collected from N GPUs and update the model parameters.
But this has the same effect as using a single GPU to calculate the gradients of N*M training samples, then updating the parameters.
So the only advantage seems to me is that you can use a larger-sized mini-batch in the same amount of time.
But is the larger-sized mini-batch necessarily better?
I thought you shouldn't use a large-sized mini-batch, in order to make the optimization more robust to saddle points.
If the larger-sized mini-batch is indeed not better, why would you care about Multi-GPU learning, or even Multi-server learning?
(The tutorial above is a synchronous training. If it was asynchronous training, then I can see the merit, since the parameters will be updated without averaging the gradients calculated by each GPU)
Upvotes: 4
Views: 2949
Reputation: 1742
More gpu means more data in a batch. And the gradients of a batch data is averaged for back-propagation.
If the learning rate of a batch is fixed, then the learning rate of a data is smaller.
If the learning rate of a data is fixed, then the learning rate of a batch is larger.
https://github.com/guotong1988/BERT-GPU
Upvotes: 0
Reputation: 9781
The main purpose for multi-GPU learning is to enable you train on large data set in shorter time. It is not necessarily better with larger mini-batch, but at least you can finish learning in a more feasible time.
More precisely, those N mini-batches are not trained in a synchronized way if you use Asynchronous SGD algorithm. As the algorithm changes when using multi-GPU, it is not equal to using MxN size mini-batch on single-GPU with SGD algorithm.
If you use sync multi-GPU training, the benefit is mainly time reduction. You could use M/N-size mini-match to maintain the effective mini-batch size, and of course the scalability is limited as smaller mini-batch size leads to more overhead. Data-exchange and synchronization on large number of computing nodes are also disasters.
Finally to solve the scalability issue, people move to A-SGD when using large number of GPUs concurrently. So probably you won't see someone using sync multi-GPU training on hundreds of (or even tens of) GPUs.
Upvotes: 2