nn.DataParallel - Training doesn't seem to start

Question

I am having a lot of problems using nn.DistributedDataParallel, because I cannot find a good working example of how to specify GPU id's within a single node. For this reason, I want to start off by using nn.DataParallel, since it should be easier to implement. According to the documentation [https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html] the following should work:

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
model = Model(arg).to(device)
model = torch.nn.DataParallel(model, device_ids=[1, 8, 9])
for step, (original, keypoints) in enumerate(train_loader):
                    original, keypoints = original.to(device), keypoints.to(device)
                    loss = model(original)
                    optimizer.zero_grad()
                    total_loss.backward()
                    optimizer.step()

However, when I start to process the model is distributed to all three GPU's, but the training doesn't start. The RAM of the GPU's remains almost empty (except for the memory used for the loading the model). This can be seen here (see GPU 1, 8, 9):

Can someone explain me why that's not working?

Thanks a lot!!

nn.DataParallel - Training doesn't seem to start

Answers (1)

Related Questions

nn.DataParallel - Training doesn&#39;t seem to start

Answers (1)

Related Questions

nn.DataParallel - Training doesn't seem to start