Reputation: 21
I am currently following the PyTorch transfer learning tutorial in: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
I have been able to complete tutorial and train on both a CPU and 1 GPU.
I am utilising Google Cloud Platform Notebook Instances and using 4 NVIDIA Tesla k80 x 4 GPU. It is here that I run into a Server Connection Error (invalid response: 504) error when I train the network on more than 1 GPU
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## Using 4 GPUs
if torch.cuda.device_count() > 1:
model_ft = nn.DataParallel(model_ft)
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=25)
The idea was to utilise the data parallelism function to utilise all available GPU (which is 4) to train the network.
Am I missing something in the implementation, please advice.
Thanks
Upvotes: 1
Views: 633
Reputation: 2046
IMO, it is easiest to use Horovod to do multi GPU training. Here is an example of a distributed training script with GPU using Horovod: https://github.com/horovod/horovod/blob/master/examples/pytorch_mnist.py
You will need to have OpenMPI installed (likely already on the box), and you will need to have Horovod installed in the python enviroment (pip install horovod
-- full install instructions are here https://github.com/horovod/horovod#install).
Then you would start your job with horovodrun -np 4 python pytorch_mnist.py
(Here some docs on how to start a Horovod run: https://horovod.readthedocs.io/en/latest/mpirun.html)
This will enable you to train not only on one node with multiple GPUs, but also across multiple nodes (e.g. across 2 nodes with 4 GPUs each).
The salient points for distributed training with Horovod are:
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
torch.cuda.manual_seed(args.seed)
DistributedSampler
is used to divide the data up across the different nodes. hvd.rank()
is used to make sure different partitions of the data are used by each of the processes and hvd.size()
captures how many processes there are in total. train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
DistributedOptimizer
-- that will take care of aggregating the gradients across the processes at the end of each minibatch: # Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer,
named_parameters=model.named_parameters(),
compression=compression)
There are a few more interesting things in the sample (e.g. growing the learning rate with the number or processes, broadcasting your parameters at the start).
Upvotes: 0