edhu
edhu

Reputation: 451

send/recv block in CuPy

I want to send a cupy array from one node to the other.

The sender has the following code:

import cupy
import cupyx.distributed
import torch.multiprocessing as mp

def send():
    cupy.cuda.Device(0).use()
    comm = cupyx.distributed.init_process_group(2, 1, host='192.168.0.5')
    send_buffer = cupy.ones((16))
    comm.send(send_buffer, 0)
    comm.barrier()
    comm.stop()

if __name__ == '__main__':
    p0 = mp.Process(target=send, args=())
    p0.start()
    p0.join()

The receiver has the following code:

import cupy
import cupyx.distributed
import torch.multiprocessing as mp

def recv():

    cupy.cuda.Device(1).use()
    comm = cupyx.distributed.init_process_group(2, 0, host='192.168.0.5')
    recv_buffer = cupy.zeros((16))
    comm.recv(recv_buffer, 1)
    comm.barrier()
    print("After barrier")
    print(recv_buffer)
    comm.stop()

if __name__ == '__main__':
    p1 = mp.Process(target=recv, args=())
    p1.start()
    p1.join()

When I launch both processes on the same node (192.168.0.5), the print on the receiver side works fine. However, when I launch the sender on another node (192.168.0.6), only "After barrier" is printed. Printing the receiver buffer never returns and just hangs. I can ping both nodes without problems. Where should I start diagnosing the potential cause for the hanging on the receiver side? Thanks!

Upvotes: 0

Views: 26

Answers (0)

Related Questions