Matheus Ianzer
Matheus Ianzer

Reputation: 465

Using Multiple GPUs outside of training in PyTorch

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus. The code follows:

def ac_distance(layer):
    total = 0
    for p in layer.weight:
      for q in layer.weight:
         total += distance(p,q)
    return total

Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.

Upvotes: 1

Views: 1491

Answers (1)

scarecrow
scarecrow

Reputation: 6864

Parallelism while training neural networks can be achieved in two ways.

  1. Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
  2. Model Parallelism - Split the computations and run them on different GPUs

As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.

However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.

In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.

There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:

  • Do gradient accumulation for larger batches
  • Use DataParallel
  • If that doesn't suffice, go with DistributedDataParallel

Upvotes: 4

Related Questions