spadel
spadel

Reputation: 1036

nn.DataParallel - Training doesn't seem to start

I am having a lot of problems using nn.DistributedDataParallel, because I cannot find a good working example of how to specify GPU id's within a single node. For this reason, I want to start off by using nn.DataParallel, since it should be easier to implement. According to the documentation [https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html] the following should work:

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
model = Model(arg).to(device)
model = torch.nn.DataParallel(model, device_ids=[1, 8, 9])
for step, (original, keypoints) in enumerate(train_loader):
                    original, keypoints = original.to(device), keypoints.to(device)
                    loss = model(original)
                    optimizer.zero_grad()
                    total_loss.backward()
                    optimizer.step()

However, when I start to process the model is distributed to all three GPU's, but the training doesn't start. The RAM of the GPU's remains almost empty (except for the memory used for the loading the model). This can be seen here (see GPU 1, 8, 9):

enter image description here

Can someone explain me why that's not working?

Thanks a lot!!

Upvotes: 2

Views: 4881

Answers (1)

Edwin Cheong
Edwin Cheong

Reputation: 979

I am making a guess here and I haven't tested it since I don't have multiple GPUs.

Since your suppose to load it to parallel first then move it to gpu

model = Model(arg)
model = torch.nn.DataParallel(model, device_ids=[1, 8, 9])
model.to(device)

You can check out here the tutorial I referenced here: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

Upvotes: 2

Related Questions