Reputation: 1
I have implemented a basic loop for training of the Pytorch's implementation of MaskRCNN. I have 4 GPUs available for training. I am using torch.nn.DataParallel()
to use multiple GPUs if I want.
However when passing an even number of GPUs like 0,1 or 0,1,2,3 I am getting the following error:-
RuntimeError: Caught RuntimeError in replica 0 on device 6.
Original Traceback (most recent call last):
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py", line 83, in forward
images, targets = self.transform(images, targets)
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 129, in forward
image = self.normalize(image)
File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 157, in normalize
return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0
But when I use an odd number of GPU train runs perfectly and I get correct results too. Can anyone help in solving this.
I have tried everything but I think there is something wrong with Pytorch code itself
Upvotes: 0
Views: 15