Reputation: 669
I'm using dist.all_gather_object (PyTorch version 1.8) to collect sample ids from all GPUs:
for batch in dataloader:
video_sns = batch["video_ids"]
logits = model(batch)
group_gather_vdnames = [None for _ in range(envs['nGPU'])]
group_gather_logits = [torch.zeros_like(logits) for _ in range(envs['nGPU'])]
dist.all_gather(group_gather_logits, logits)
dist.all_gather_object(group_gather_vdnames, video_sns)
The line dist.all_gather(group_gather_logits, logits)
works properly,
but program hangs at line dist.all_gather_object(group_gather_vdnames, video_sns)
.
I wonder why the program hangs at dist.all_gather_object()
, how can I fix it ?
EXTRA INFO: I run my ddp code on a local machine with multiple GPUs. The start script is:
export NUM_NODES=1
export NUM_GPUS_PER_NODE=2
export NODE_RANK=0
export WORLD_SIZE=$(($NUM_NODES * $NUM_GPUS_PER_NODE))
python -m torch.distributed.launch \
--nproc_per_node=$NUM_GPUS_PER_NODE \
--nnodes=$NUM_NODES \
--node_rank $NODE_RANK \
main.py \
--my_args
Upvotes: 3
Views: 2282
Reputation: 669
Turns out we need to set the device id manually as mentioned in the docstring of dist.all_gather_object()
API.
Adding
torch.cuda.set_device(envs['LRANK']) # my local gpu_id
and the codes work.
I always thought the GPU ID is set automatically by PyTorch dist, turns out it's not.
Upvotes: 2