Pytorch dist.all_gather_object hangs

Question

I'm using dist.all_gather_object (PyTorch version 1.8) to collect sample ids from all GPUs:

for batch in dataloader:
    video_sns = batch["video_ids"]
    logits = model(batch)
    group_gather_vdnames = [None for _ in range(envs['nGPU'])]
    group_gather_logits = [torch.zeros_like(logits) for _ in range(envs['nGPU'])]
    dist.all_gather(group_gather_logits, logits)
    dist.all_gather_object(group_gather_vdnames, video_sns)

The line dist.all_gather(group_gather_logits, logits) works properly, but program hangs at line dist.all_gather_object(group_gather_vdnames, video_sns).

I wonder why the program hangs at dist.all_gather_object(), how can I fix it ?

EXTRA INFO: I run my ddp code on a local machine with multiple GPUs. The start script is:

export NUM_NODES=1
export NUM_GPUS_PER_NODE=2
export NODE_RANK=0
export WORLD_SIZE=$(($NUM_NODES * $NUM_GPUS_PER_NODE))

python -m torch.distributed.launch \
       --nproc_per_node=$NUM_GPUS_PER_NODE \
       --nnodes=$NUM_NODES \
       --node_rank $NODE_RANK \
       main.py \
       --my_args

Pytorch dist.all_gather_object hangs

Answers (1)

Related Questions