Reputation: 1
I am doing instance segmentation using maskrcnn_resnet50_fpn_v2. I have only class in the training dataset but each image can have up 1000 instances of the single class. I am running all my codes on kaggle notebook. I set seed using this function
def seed_everything(seed):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
The loss is perfectly reproducible if I use cpu but in gpu the loss is different everytime I run the training loop But the first epoch loss is same. I am only using single image to train the model to see if everything works or not and reproducible The model
def get_model():
seed_everything(CFG.seed)
model = maskrcnn_resnet50_fpn_v2(weights='DEFAULT')
seed_everything(CFG.seed)
class_names = ['field', "background"]
# Get the number of input features for the classifier
in_features_box = model.roi_heads.box_predictor.cls_score.in_features
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
# Get the numbner of output channels for the Mask Predictor
dim_reduced = model.roi_heads.mask_predictor.conv5_mask.out_channels
# Replace the box predictor
model.roi_heads.box_predictor = FastRCNNPredictor(in_channels=in_features_box, num_classes=len(class_names))
# Replace the mask predictor
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_channels=in_features_mask, dim_reduced=dim_reduced, num_classes=len(class_names))
return model.to(CFG.device)
The training loop
model = get_model()
optimizer = torch.optim.Adam(model.parameters())
no_epochs = 10
epoch_loss = []
model.train()
for epoch_no in tqdm(range(1, no_epochs+1)):
optimizer.zero_grad()
losses = model(input_batch, targets)
loss = sum([loss for loss in losses.values()])
loss.backward()
optimizer.step()
print(f"Epoch no: {epoch_no} | loss: {loss.item():.4f}")
epoch_loss.append(loss.item())
After running the model first time the losses look like this
Epoch no: 1 | loss: 6.6777
Epoch no: 2 | loss: 8.2722
Epoch no: 3 | loss: 3.3449
Epoch no: 4 | loss: 2.6268
Epoch no: 5 | loss: 2.5746
Epoch no: 6 | loss: 2.2247
Epoch no: 7 | loss: 2.2633
Epoch no: 8 | loss: 2.3275
Epoch no: 9 | loss: 2.1480
Epoch no: 10 | loss: 2.0851
If I run the training loop again the losses are
Epoch no: 1 | loss: 6.6777
Epoch no: 2 | loss: 8.2743
Epoch no: 3 | loss: 3.4022
Epoch no: 4 | loss: 2.6555
Epoch no: 5 | loss: 2.6238
Epoch no: 6 | loss: 2.2365
Epoch no: 7 | loss: 2.1556
Epoch no: 8 | loss: 2.1003
Epoch no: 9 | loss: 2.1026
Epoch no: 10 | loss: 2.0324
The first loss is always same.But from the second epoch it starts to get different
Secondly If I create a new model with this code
model = get_model()
then pass the code below two times
losses = model(input_batch, targets)
loss = sum([loss for loss in losses.values()])
print(loss.item())
it gives two different loss value
First time loss: 6.677738666534424
Second time loss: 7.194179534912109
I do not get how this is possible since I did not change any weights the model should not have changed.
I have tried using torch.use_deterministic_algorithms(True) but If I do this my gpu run out of memory. I have seen some topics regarding this issues none of them has worked for me. What should I do? I want the model to reproducible so that I can tune the hyper parameters and choose the best ones. My dataset only has 50 train images
Upvotes: 0
Views: 40
Reputation: 11
This is an expected behavior, as of right now it is not entirely possible to run perfectly deterministic operations on the GPU.
See this answer for more information : How to handle non-determinism when training on a GPU?
Basically, some operations on the GPU are non deterministic because of multi threading.
Your losses are still quite similar, it should suffice for your hyper-parameter tuning.
Upvotes: 0