MPI+CUDA : CPU-to-GPU data transfer with multiple CPU-host (MPI) and a single GPU (CUDA)

Question

I would like to improve my "data transfer" algorithm between MPI-CPU-node and a single GPU.

With NUMPROCS nodes, Each MPI-node has a 1D array with Ntot/NUMPROCESS float4.

My algo is very simple:

1) the 1D arrays are gathered (MPI_GATHER) in a big array (size Ntot) on the master node.

2) With the master node, the big array is sent to the GPU via cudaMemcpy function. The CUDA kernel is launched with the master node.

Is it possible to avoid the first step? I mean, each MPI-node sends its array via cudaMemcpy and the concatenation is done directly on the memory of the GPU.

MPI+CUDA : CPU-to-GPU data transfer with multiple CPU-host (MPI) and a single GPU (CUDA)

Answers (1)

Related Questions