RichCode
RichCode

Reputation: 21

How to train PyTorch transfer learning tutorial with more then 1 GPU

I am currently following the PyTorch transfer learning tutorial in: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

I have been able to complete tutorial and train on both a CPU and 1 GPU.

I am utilising Google Cloud Platform Notebook Instances and using 4 NVIDIA Tesla k80 x 4 GPU. It is here that I run into a Server Connection Error (invalid response: 504) error when I train the network on more than 1 GPU

model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Using 4 GPUs
if torch.cuda.device_count() > 1:
    model_ft = nn.DataParallel(model_ft)
model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()

optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=25)

The idea was to utilise the data parallelism function to utilise all available GPU (which is 4) to train the network.

Am I missing something in the implementation, please advice.

Thanks

Upvotes: 1

Views: 633

Answers (1)

Daniel Schneider
Daniel Schneider

Reputation: 2046

IMO, it is easiest to use Horovod to do multi GPU training. Here is an example of a distributed training script with GPU using Horovod: https://github.com/horovod/horovod/blob/master/examples/pytorch_mnist.py

You will need to have OpenMPI installed (likely already on the box), and you will need to have Horovod installed in the python enviroment (pip install horovod -- full install instructions are here https://github.com/horovod/horovod#install).

Then you would start your job with horovodrun -np 4 python pytorch_mnist.py (Here some docs on how to start a Horovod run: https://horovod.readthedocs.io/en/latest/mpirun.html)

This will enable you to train not only on one node with multiple GPUs, but also across multiple nodes (e.g. across 2 nodes with 4 GPUs each).

The salient points for distributed training with Horovod are:

  • Horovod will start as many processes as you instruct it to, so in your case 4. Each process will run the same script that will only differ in the Horovod/MPI rank. The rank is then used to get the corresponding cuda device:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)
  • A DistributedSampler is used to divide the data up across the different nodes. hvd.rank() is used to make sure different partitions of the data are used by each of the processes and hvd.size() captures how many processes there are in total.
    train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
  • Wrap your optimizer with a DistributedOptimizer -- that will take care of aggregating the gradients across the processes at the end of each minibatch:
    # Horovod: wrap optimizer with DistributedOptimizer.
    optimizer = hvd.DistributedOptimizer(optimizer,
                                         named_parameters=model.named_parameters(),
                                         compression=compression)
  • Most importantly, you don't have to modify your actual model and training loop, which is pretty neat.

There are a few more interesting things in the sample (e.g. growing the learning rate with the number or processes, broadcasting your parameters at the start).

Upvotes: 0

Related Questions