Reputation: 2385
I am trying to train a Faster RCNN Network on a custom dataset consisting of images for object detection. However, I don't want to directly give an RGB image as input, I actually need to pass it through another network (a feature extractor) along with the corresponding thermal image and give the extracted features as the input to the FRCNN Network. The feature extractor combines these two images into a 4 channel tensor and the output is a 5 channel tensor. It is this 5 channel tensor that I wish to give as input to the Faster RCNN Network.
I followed the PyTorch docs for Object Detection Finetuning (link here) and came up with the following code to suit my dataset.
class CustomDataset(torch.utils.data.Dataset):
def __getitem__(self, idx):
self.num_classes = 5
img_rgb_path = os.path.join(self.root, "rgb/", self.rgb_imgs[idx])
img_thermal_path = os.path.join(self.root, "thermal/", self.thermal_imgs[idx])
img_rgb = Image.open(img_rgb_path)
img_rgb = np.array(img_rgb)
x_rgb = TF.to_tensor(img_rgb)
x_rgb.unsqueeze_(0)
img_thermal = Image.open(img_thermal_path)
img_thermal = np.array(img_thermal)
img_thermal = np.expand_dims(img_thermal,-1)
x_th = TF.to_tensor(img_thermal)
x_th.unsqueeze_(0)
print(x_rgb.shape) # shape of [3,640,512]
print(x_th.shape) # shape of [1,640,512]
input = torch.cat((x_rgb,x_th),dim=1) # shape of [4,640,512]
img = self.feature_extractor(input) # My custom feature extractor which returns a 5 dimensional tensor
print(img.shape) # shape of [5,640,512]
filename = os.path.join(self.root,'annotations',self.annotations[idx])
tree = ET.parse(filename)
objs = tree.findall('object')
num_objs = len(objs)
boxes = np.zeros((num_objs, 4), dtype=np.uint16)
labels = np.zeros((num_objs), dtype=np.float32)
seg_areas = np.zeros((num_objs), dtype=np.float32)
boxes = []
for ix, obj in enumerate(objs):
bbox = obj.find('bndbox')
x1 = float(bbox.find('xmin').text)
y1 = float(bbox.find('ymin').text)
x2 = float(bbox.find('xmax').text)
y2 = float(bbox.find('ymax').text)
cls = self._class_to_ind[obj.find('name').text.lower().strip()]
boxes.append([x1, y1, x2, y2])
labels[ix] = cls
seg_areas[ix] = (x2 - x1 + 1) * (y2 - y1 + 1)
boxes = torch.as_tensor(boxes, dtype=torch.float32)
seg_areas = torch.as_tensor(seg_areas, dtype=torch.float32)
labels = torch.as_tensor(labels, dtype=torch.float32)
target = {'boxes': boxes,
'labels': labels,
'seg_areas': seg_areas,
}
return img,target
My main function code is as follows
import utils
def train_model(model, criterion,dataloader,num_epochs):
since = time.time()
best_model = model
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# optimizer = lr_scheduler(optimizer, epoch)
model.train() # Set model to training mode
running_loss = 0.0
running_corrects = 0
for data in dataloader:
inputs, labels = data[0][0], data[1]
inputs = inputs.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
outputs = model(inputs, labels)
_, preds = torch.max(outputs.data, 1)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
running_corrects += torch.sum(preds == labels).item()
epoch_loss = running_loss / len(dataloader)
epoch_acc = running_corrects / len(dataloader)
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
backbone.out_channels = 1280
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
output_size=7,
sampling_ratio=2)
num_classes = 5
model = FasterRCNN(backbone = backbone,num_classes=5,rpn_anchor_generator=anchor_generator,box_roi_pool=roi_pooler)
dataset = CustomDataset('train_folder/')
data_loader_train = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=True,collate_fn=utils.collate_fn)
train_model(model, criterion, data_loader_train, num_epochs=10)
The collate_fn defined in the utils.py file is the following
def collate_fn(batch):
return tuple(zip(*batch))
I, however, get the following error while training
Traceback (most recent call last):
File "train.py", line 147, in <module>
train_model(model, criterion, data_loader_train, num_epochs)
File "train.py", line 58, in train_model
outputs = model(inputs, labels)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 66, in forward
images, targets = self.transform(images, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py", line 46, in forward
image = self.normalize(image)
File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py", line 66, in normalize
return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (5) must match the size of tensor b (3) at non-singleton dimension 0
I am a newbie in Pytorch.
Upvotes: 0
Views: 6762
Reputation: 1518
The backbone network you are using for the FasterRCNN is a pretrained mobilenet_v2.
The input channel of a network is decided by the number of channels of the input data. Since the (backbone) model is pretrained (on natural images?) with 3 channels 3xNxM
, you cannot use it for tensors of dimension 5xPxQ
(skipping the singleton <batch_size>
dimension).
Basically, you have 2 options,
1. Reduce the output channel dimension of the 1st network to 3 (better if you are training it from scratch)
2. Make a new backbone for the FasterRCNN with 5 channels in input and train it from scratch.
As for explaining the error message,
return (image - mean[:, None, None]) / std[:, None, None]
Pytorch is trying to normalize the input image where your input image has dimension (5,M,N)
and teh tensors mean
and std
have 3 channels instead of 5
Upvotes: 1