Reputation: 67
When training MXNet, if the batch size is large(say 128), and the number of GPUs is small(say 2), and each GPU can only handle a few samples each iteration(say 16). By default, the maximum batch size of this configuration is 16 * 2 = 32.
In theory, we can run 4 iterations before updating the weights, to make effective batch size 128. Is this possible with MXNet?
Upvotes: 0
Views: 493
Reputation: 825
Editing this answer with a more streamlined approach (memory-wise). You have to configure each parameter to accumulate gradients, run your 4 forward passes, run backwards, and then manually zero your gradients.
As per https://discuss.mxnet.io/t/aggregate-gradients-manually-over-n-batches/504/2
"This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':
for p in net.collect_params().values():
p.grad_req = 'add'
"And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly."
Vishaal
Upvotes: 1