Reputation: 188
I am using PyTorch's multiprocessing framework to distribute my training across multiple GPUs. I'm doing this over the batch size, so each GPU has its independent batch that it calculates the gradient over. I then average the gradients across the GPUs by using PyTorch's all_reduce function. However, the backward passes slow down significantly when compared to a single GPU training.
Since I have a manual gradient calculation, I have the following function to average tensors across the GPUs.
def average_across_processes(vals):
for val in vals:
dist.all_reduce(val, op=dist.ReduceOp.AVG)
This updates the tensor in-place, so I have no need to pass anything back from the function. Is there something that I'm doing inefficiently causing this massive slowdown?
Upvotes: 0
Views: 113