How to train PyTorch transfer learning tutorial with more then 1 GPU

Question

I am currently following the PyTorch transfer learning tutorial in: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

I have been able to complete tutorial and train on both a CPU and 1 GPU.

I am utilising Google Cloud Platform Notebook Instances and using 4 NVIDIA Tesla k80 x 4 GPU. It is here that I run into a Server Connection Error (invalid response: 504) error when I train the network on more than 1 GPU

model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Using 4 GPUs
if torch.cuda.device_count() > 1:
    model_ft = nn.DataParallel(model_ft)
model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()

optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=25)

The idea was to utilise the data parallelism function to utilise all available GPU (which is 4) to train the network.

Am I missing something in the implementation, please advice.

Thanks

How to train PyTorch transfer learning tutorial with more then 1 GPU

Answers (1)

Related Questions