f.k
f.k

Reputation: 85

Torch: Nccl available but not used (?)

I use PyTorch 1.9.0 but get the following error when trying to run a distributed version of a model:

File "/home/ferdiko/fastmoe/examples/transformer-xl/train.py", line 315, in <module>
    para_model = DistributedGroupedDataParallel(model).to(device)
  File "/home/ferdiko/anaconda3/envs/fastmoe/lib/python3.9/site-packages/fastmoe-0.2.1-py3.9-linux-x86_64.egg/fmoe/distributed.py", line 45, in __init__
    self.comms["dp"] = get_torch_default_comm()
  File "/home/ferdiko/anaconda3/envs/fastmoe/lib/python3.9/site-packages/fastmoe-0.2.1-py3.9-linux-x86_64.egg/fmoe/utils.py", line 30, in get_torch_default_comm
    raise RuntimeError("Unsupported PyTorch version")

if I run torch.cuda.nccl.version() I get 2708. The developers suggested to run:

x = torch.rand(10).cuda() 
print(torch.cuda.nccl.is_available(x))

which gives me False. Does this actually mean that there's a problem with PyTorch and NCCL?

Upvotes: 0

Views: 1268

Answers (1)

MWB
MWB

Reputation: 12567

torch.cuda.nccl.is_available takes a sequence of tensors, and if they are on different devices, there is hope that you'll get a True:

    In [1]: import torch
    
    In [2]: x = torch.rand(1024, 1024, device='cuda:0')
    
    In [3]: y = torch.rand(1024, 1024, device='cuda:1')
    
    In [4]: torch.cuda.nccl.is_available([x, y])
    Out[4]: True

If you give it just one tensor, torch.cuda.nccl.is_available will iterate through it instead, but different parts of the same tensor are always on the same device, so you'll always get a False:

    In [5]: torch.cuda.nccl.is_available(x)
    Out[5]: False

    In [6]: torch.cuda.nccl.is_available([x])
    Out[6]: True

Upvotes: 3

Related Questions