Faster-RCNN Pytorch problem at prediction time with image dimensions

Question

I am finetuning Faster-RCNN using PyTorch according to this tutorial: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

The results are pretty good but making predictions only work when feeding a single tensor to the model. For example:

# This works well
>>> img, _ = dataset_test[3]
>>> img.shape
torch.Size([3, 1200, 1600])
>>> model.eval()
>>> with torch.no_grad():
    .. preds = model([img.to(device)])

But when I feed multiple tensors at once I get that error:

>>> random_idx = torch.randint(high=50, size=(4,))
>>> images = torch.stack([dataset_test[idx][0] for idx in random_idx])
>>> images.shape
torch.Size([4, 3, 1200, 1600])
>>> with torch.no_grad():
    .. preds = model(images.to(device))
RuntimeError                              Traceback (most recent call last)
 in ()
      5 model.eval()
      6 with torch.no_grad():
----> 7   prediction =  model(images.to(device))

...

RuntimeError: The expanded size of the tensor (1600) must match the existing size (1066) at non-singleton dimension 2.  Target sizes: [3, 1200, 1600].  Tensor sizes: [3, 800, 1066]

Edit

Works when feeding a list of 3D tensors (IMO this behaviour is a bit weird, I cannot understand why it is not working with a 4D tensor):

>>> random_idx = torch.randint(high=50, size=(4,))
>>> images = [dataset_test[idx][0].to(device) for idx in random_idx]
>>> images.shape
torch.Size([4, 3, 1200, 1600])
>>> with torch.no_grad():
    .. preds = model(images)

asymptote · Accepted Answer

MaskRCNN expects a list of tensors as 'input images' and a list of dictionaries as 'target' during training mode. This particular design choice is due to the fact that each image can have variable number of objects, i.e. target tensor of each image will be of variable dimensions, hence we are forced to use a list instead of a batch tensor of targets.

However, it is still not entirely necessary to use a list of image tensors instead of using a batch tensor. My guess is that they have gone with a list of tensors for images as well, for the sake of consistency. Also, this gives an added advantage of being able able to use images of variable sizes as input rather than a fixed size.

Due to this particular design choice, the model expects a list of tensors as input during evaluation mode as well.

As for as the speed performance of the model is concerned, this design choice might have some negative impact during evaluation, but I can not say with hundred percent conviction. However, during training, since we have variable dimension of target tensors for each image, we are forced to iterate over all images one by one for loss calculation. So, there would be no speed gain of using a batch tensor of images over a list of image tensors during training.

Faster-RCNN Pytorch problem at prediction time with image dimensions

Edit

Answers (1)

Related Questions