shoab ahamed
shoab ahamed

Reputation: 1

MaskRCNN in pytorch gives different loss everytime

I am doing instance segmentation using maskrcnn_resnet50_fpn_v2. I have only class in the training dataset but each image can have up 1000 instances of the single class. I am running all my codes on kaggle notebook. I set seed using this function

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  

The loss is perfectly reproducible if I use cpu but in gpu the loss is different everytime I run the training loop But the first epoch loss is same. I am only using single image to train the model to see if everything works or not and reproducible The model

def get_model():
    seed_everything(CFG.seed)
    model = maskrcnn_resnet50_fpn_v2(weights='DEFAULT')
    seed_everything(CFG.seed)
    
    class_names = ['field', "background"]
    # Get the number of input features for the classifier
    in_features_box = model.roi_heads.box_predictor.cls_score.in_features
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    # Get the numbner of output channels for the Mask Predictor
    dim_reduced = model.roi_heads.mask_predictor.conv5_mask.out_channels
    # Replace the box predictor
    model.roi_heads.box_predictor = FastRCNNPredictor(in_channels=in_features_box, num_classes=len(class_names))
    # Replace the mask predictor
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_channels=in_features_mask, dim_reduced=dim_reduced, num_classes=len(class_names))
    
    return model.to(CFG.device)

The training loop

model = get_model()
optimizer = torch.optim.Adam(model.parameters())
no_epochs = 10
epoch_loss = []
model.train()
for epoch_no in tqdm(range(1, no_epochs+1)):
    optimizer.zero_grad()
    
    losses = model(input_batch, targets)
    loss = sum([loss for loss in losses.values()]) 
    loss.backward()
    optimizer.step()
    print(f"Epoch no: {epoch_no} | loss: {loss.item():.4f}")
    epoch_loss.append(loss.item())

After running the model first time the losses look like this

Epoch no: 1 | loss: 6.6777
Epoch no: 2 | loss: 8.2722
Epoch no: 3 | loss: 3.3449
Epoch no: 4 | loss: 2.6268
Epoch no: 5 | loss: 2.5746
Epoch no: 6 | loss: 2.2247
Epoch no: 7 | loss: 2.2633
Epoch no: 8 | loss: 2.3275
Epoch no: 9 | loss: 2.1480
Epoch no: 10 | loss: 2.0851

If I run the training loop again the losses are

Epoch no: 1 | loss: 6.6777
Epoch no: 2 | loss: 8.2743
Epoch no: 3 | loss: 3.4022
Epoch no: 4 | loss: 2.6555
Epoch no: 5 | loss: 2.6238
Epoch no: 6 | loss: 2.2365
Epoch no: 7 | loss: 2.1556
Epoch no: 8 | loss: 2.1003
Epoch no: 9 | loss: 2.1026
Epoch no: 10 | loss: 2.0324

The first loss is always same.But from the second epoch it starts to get different

Secondly If I create a new model with this code

model = get_model()

then pass the code below two times

losses = model(input_batch, targets)
loss = sum([loss for loss in losses.values()]) 
print(loss.item())

it gives two different loss value

First time loss:  6.677738666534424
Second time loss:  7.194179534912109

I do not get how this is possible since I did not change any weights the model should not have changed.

I have tried using torch.use_deterministic_algorithms(True) but If I do this my gpu run out of memory. I have seen some topics regarding this issues none of them has worked for me. What should I do? I want the model to reproducible so that I can tune the hyper parameters and choose the best ones. My dataset only has 50 train images

Upvotes: 0

Views: 40

Answers (1)

Fotisk
Fotisk

Reputation: 11

This is an expected behavior, as of right now it is not entirely possible to run perfectly deterministic operations on the GPU.

See this answer for more information : How to handle non-determinism when training on a GPU?

Basically, some operations on the GPU are non deterministic because of multi threading.

Your losses are still quite similar, it should suffice for your hyper-parameter tuning.

Upvotes: 0

Related Questions