wynne yin
wynne yin

Reputation: 1

I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error


 def runTpoly(rank, size, pp, cs, pkArithmetics_evals, 
       pkSelectors_evals, domain):

    init_process(rank, size)
    group2 = torch.distributed.new_group([1,2])
    if rank == 0:
        device = torch.device(f"cuda:{rank}")
        wo_eval_8n = torch.ones(SCALE * 8 * 1, 4, dtype=torch.int64, device='cuda')

    if rank == 1:
        wo_eval_8n = torch.ones(SCALE * 8 * 10, 4, dtype=torch.int64, device='cuda')
        wo_eval_8n=wo_eval_8n+wo_eval_8n
        send(wo_eval_8n, 2)
    if rank == 2:
        wo_eval_8n = torch.ones(SCALE * 8 * 10, 4, dtype=torch.int64, device='cuda')
        print(wo_eval_8n.size())
        recv(wo_eval_8n,1)
        print(wo_eval_8n)
    if rank == 3:[enter image description here](https://i.sstatic.net/M6FMng1p.png)
        wo_eval_8n = torch.ones(SCALE * 10 * 10, 4, dtype=torch.int64, device='cuda')
        print(wo_eval_8n.size())
    # 清理进程组
    dist.destroy_process_group()


if __name__ == "__main__":
    
    world_size = 4  # GPU数目
    print(torch.__file__)
    pp, pk, cs = load("/home/whyin/data/9-data/")
    domain= Radix2EvaluationDomain.new(cs.circuit_bound())
    spawn(runTpoly, args=(world_size,pp,cs,pk.arithmetics_evals,pk.selectors_evals,domain), nprocs=world_size, join=True)

I want to conduct point-to-point communication between rank 1 and rank 2, but the following error will occur. However, I've already verified that in my code, all ranks can communicate with rank 0. Besides, the topology structure of my GPUs is a fully connected structure with four GPUs, and there is no situation where they can't be physically connected. my pytorch is 2.0

enter image description here

RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer
Exception raised from recvBytes at /home/whyin/pnp_new/PNP/torch/csrc/distributed/c10d/Utils.hpp:616 (most recent call first)

I want to create a communication group, but I find that communication is still not possible in this way. I hope to achieve direct communication between two ranks without going through rank 0.

Upvotes: 0

Views: 57

Answers (1)

wynne yin
wynne yin

Reputation: 1

The problem is that every time we run the program, need to clean the gpu environment. The code is:

def clear_nccl_environment():
    dist.barrier() 
    torch.cuda.empty_cache()  

Upvotes: 0

Related Questions