Mayank Singh
Mayank Singh

Reputation: 1

Dimension error when using multiple GPUs for Pytorch MaskRCNN training

I have implemented a basic loop for training of the Pytorch's implementation of MaskRCNN. I have 4 GPUs available for training. I am using torch.nn.DataParallel() to use multiple GPUs if I want. However when passing an even number of GPUs like 0,1 or 0,1,2,3 I am getting the following error:-

RuntimeError: Caught RuntimeError in replica 0 on device 6.
Original Traceback (most recent call last):
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py", line 83, in forward
    images, targets = self.transform(images, targets)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 129, in forward
    image = self.normalize(image)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 157, in normalize
    return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

But when I use an odd number of GPU train runs perfectly and I get correct results too. Can anyone help in solving this.

I have tried everything but I think there is something wrong with Pytorch code itself

Upvotes: 0

Views: 15

Answers (0)

Related Questions