Reputation: 23
I would like to improve my "data transfer" algorithm between MPI-CPU-node and a single GPU.
With NUMPROCS nodes, Each MPI-node has a 1D array with Ntot/NUMPROCESS float4.
My algo is very simple:
1) the 1D arrays are gathered (MPI_GATHER) in a big array (size Ntot) on the master node.
2) With the master node, the big array is sent to the GPU via cudaMemcpy function. The CUDA kernel is launched with the master node.
Is it possible to avoid the first step? I mean, each MPI-node sends its array via cudaMemcpy and the concatenation is done directly on the memory of the GPU.
Upvotes: 1
Views: 1145
Reputation: 61
Since your MPI-CPU nodes are running on the same physical host as the GPU, you can avoid your first step.
You can use the asynchronous function CudaMemcpyAsync()
to do your second step. The function has a stream param. It helps with doing GPU computing and memcpy
at the same time.
In each process,you can use CudaSetDevice(devicenumber)
to control the GPU you choose.
For details,see the CUDA Manual.
Upvotes: 1