Reputation: 451
I want to send a cupy array from one node to the other.
The sender has the following code:
import cupy
import cupyx.distributed
import torch.multiprocessing as mp
def send():
cupy.cuda.Device(0).use()
comm = cupyx.distributed.init_process_group(2, 1, host='192.168.0.5')
send_buffer = cupy.ones((16))
comm.send(send_buffer, 0)
comm.barrier()
comm.stop()
if __name__ == '__main__':
p0 = mp.Process(target=send, args=())
p0.start()
p0.join()
The receiver has the following code:
import cupy
import cupyx.distributed
import torch.multiprocessing as mp
def recv():
cupy.cuda.Device(1).use()
comm = cupyx.distributed.init_process_group(2, 0, host='192.168.0.5')
recv_buffer = cupy.zeros((16))
comm.recv(recv_buffer, 1)
comm.barrier()
print("After barrier")
print(recv_buffer)
comm.stop()
if __name__ == '__main__':
p1 = mp.Process(target=recv, args=())
p1.start()
p1.join()
When I launch both processes on the same node (192.168.0.5), the print on the receiver side works fine. However, when I launch the sender on another node (192.168.0.6), only "After barrier" is printed. Printing the receiver buffer never returns and just hangs. I can ping both nodes without problems. Where should I start diagnosing the potential cause for the hanging on the receiver side? Thanks!
Upvotes: 0
Views: 26