Improving asynchronous execution in CUDA

Question

I am currently writing a programme that performs large simulations on the GPU using the CUDA API. In order to accelerate the performance, I tried to run my kernels simultaneously and then asynchronously copy the result into the host memory again. The code looks roughly like this:

#define NSTREAMS   8
#define BLOCKDIMX  16
#define BLOCKDIMY  16

void domainUpdate(float* domain_cpu,       // pointer to domain on host
                  float* domain_gpu,       // pointer to domain on device
                  const unsigned int dimX,
                  const unsigned int dimY,
                  const unsigned int dimZ)
{
    dim3 blocks((dimX + BLOCKDIMX - 1) / BLOCKDIMX, (dimY + BLOCKDIMY - 1) / BLOCKDIMY);
    dim3 threads(BLOCKDIMX, BLOCKDIMY);

    for (unsigned int ii = 0; ii < NSTREAMS; ++ii) {

        updateDomain3D<<>>(domain_gpu,
                                                           dimX, 0,  dimX - 1, // dimX, minX, maxX
                                                           dimY, 0,  dimY - 1, // dimY, minY, maxY
                                                           dimZ, dimZ * ii / NSTREAMS,  dimZ * (ii + 1) / NSTREAMS - 1); // dimZ, minZ, maxZ

        unsigned int offset = dimX * dimY * dimZ * ii / NSTREAMS;
        cudaMemcpyAsync(domain_cpu + offset ,
                        domain_gpu+ offset ,
                        sizeof(float) * dimX * dimY * dimZ / NSTREAMS,
                        cudaMemcpyDeviceToHost, streams[ii]);
    }

    cudaDeviceSynchronize();
}

All in all it is just a simple for-loop, looping over all streams (8 in this case) and dividing the work. This actually is a deal faster (up to 30% performance gain), although maybe less than I had hoped. I analysed a typical cycle in Nvidia's Compute Visual Profiler, and the execution looks like this:

CUDA API trace in the Compute Visual Profiler

As can be seen in the picture, the kernels do overlap, although never more than two kernels are running at the same time. I tried the same thing for different numbers of streams and different sizes of the simulation domain, but this is always the case.

So my question is: is there a way to encourage/force the GPU scheduler to run more than two things at the same time? Or is this a limitation dependent on the GPU device that cannot be represented in the code?

My system specifications are: 64-bit Windows 7, and a GeForce GTX 670 graphics card (that's Kepler architecture, compute capability 3.0).

tera · Accepted Answer

Kernels overlap only if the GPU has resources left to run a second kernel. Once the GPU is fully loaded, there is no gain from running more kernels in parallel, so the driver does not do that.

Improving asynchronous execution in CUDA

Answers (1)

Related Questions