Reputation: 1
def runTpoly(rank, size, pp, cs, pkArithmetics_evals,
pkSelectors_evals, domain):
init_process(rank, size)
group2 = torch.distributed.new_group([1,2])
if rank == 0:
device = torch.device(f"cuda:{rank}")
wo_eval_8n = torch.ones(SCALE * 8 * 1, 4, dtype=torch.int64, device='cuda')
if rank == 1:
wo_eval_8n = torch.ones(SCALE * 8 * 10, 4, dtype=torch.int64, device='cuda')
wo_eval_8n=wo_eval_8n+wo_eval_8n
send(wo_eval_8n, 2)
if rank == 2:
wo_eval_8n = torch.ones(SCALE * 8 * 10, 4, dtype=torch.int64, device='cuda')
print(wo_eval_8n.size())
recv(wo_eval_8n,1)
print(wo_eval_8n)
if rank == 3:[enter image description here](https://i.sstatic.net/M6FMng1p.png)
wo_eval_8n = torch.ones(SCALE * 10 * 10, 4, dtype=torch.int64, device='cuda')
print(wo_eval_8n.size())
# 清理进程组
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 4 # GPU数目
print(torch.__file__)
pp, pk, cs = load("/home/whyin/data/9-data/")
domain= Radix2EvaluationDomain.new(cs.circuit_bound())
spawn(runTpoly, args=(world_size,pp,cs,pk.arithmetics_evals,pk.selectors_evals,domain), nprocs=world_size, join=True)
I want to conduct point-to-point communication between rank 1 and rank 2, but the following error will occur. However, I've already verified that in my code, all ranks can communicate with rank 0. Besides, the topology structure of my GPUs is a fully connected structure with four GPUs, and there is no situation where they can't be physically connected. my pytorch is 2.0
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer
Exception raised from recvBytes at /home/whyin/pnp_new/PNP/torch/csrc/distributed/c10d/Utils.hpp:616 (most recent call first)
I want to create a communication group, but I find that communication is still not possible in this way. I hope to achieve direct communication between two ranks without going through rank 0.
Upvotes: 0
Views: 57
Reputation: 1
The problem is that every time we run the program, need to clean the gpu environment. The code is:
def clear_nccl_environment():
dist.barrier()
torch.cuda.empty_cache()
Upvotes: 0