justin_sakong
justin_sakong

Reputation: 289

Input image size of Faster-RCNN model in Pytorch

I'm Trying to implement of Faster-RCNN model with Pytorch. In the structure, First element of model is Transform.

from torchvision.models.detection import fasterrcnn_resnet50_fpn

model = fasterrcnn_resnet50_fpn(pretrained=True)

print(model.transform)
GeneralizedRCNNTransform(
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    Resize(min_size=(800,), max_size=1333, mode='bilinear')
)

When images pass forward of Resize(), They come out with (800,h) or (w, 1333) according to ratio of Width and Height.

for i in range(2):
    _, image, target = testset.__getitem__(i)
    img = image.unsqueeze(0)
    output, _ = model.transform(img)

Before Transform : torch.Size([512, 640])
After Transform : [(800, 1000)]
Before Transform : torch.Size([315, 640])
After Transform : [(656, 1333)]

My question is how to get those resized output and why they use This method? I can't find the information in the paper and I can't understand the source code about transform in fasterrcnn_resnet50_fpn.

Sorry for my English

Upvotes: 3

Views: 4446

Answers (1)

Omkar Shidore
Omkar Shidore

Reputation: 31

GeneralizedRCNN data transform: https://github.com/pytorch/vision/blob/922db3086e654871c35cd80c2c01eabb65d78475/torchvision/models/detection/generalized_rcnn.py#L15

performs the data transformation on the inputs to feed into the model

min_size: minimum size of the image to be rescaled before feeding it to the backbone.

max_size: maximum size of the image to be rescaled before feeding it to the backbone

https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py#L256

I couldn't either find out why it was generalize for min 800 and max 1333, didn't find anything in research paper either.

but as the 1st layer is a Conv layer, the input to the network is fixed size, I apply many other augmentations such as mirror, random cropping etc, inspired by SSD based networks. Hence I would prefer to do all augmentation in a separate place once instead of twice. I would assume the model should work the best during validation using images with shapes and other properties as close as possible to the training data.

though you can experiment with custom min_size and max_size...

`

from .transform import GeneralizedRCNNTransform

min_size = 900 #changed from default
max_size = 1433 #changed from default
image_mean = [0.485, 0.456, 0.406]
image_std = [0.229, 0.224, 0.225]

model = fasterrcnn_resnet50_fpn(pretrained=True, min_size, max_size, image_mean, image_std)

#batch of 4 image, 4 bboxes
images, boxes = torch.rand(4, 3, 600, 1200), torch.rand(4, 11, 4) 
labels = torch.randint(1, 91, (4, 11))
images = list(image for image in images)
targets = []
for i in range(len(images)):
    d = {}
    d['boxes'] = boxes[i]
    d['labels'] = labels[i]
    targets.append(d)
    
output = model(images, targets)

`

or you can completely write your transforms https://pytorch.org/vision/stable/transforms.html

'

from torchvision.transforms import transforms as T
model = fasterrcnn_resnet50_rpn()
model.transform = T.Compose([*check torchvision.transforms for more*])

'

Hope this helps.

Upvotes: 3

Related Questions