Asynchronous CUDA transfer calls not behaving asynchronously

Question

I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.

I have code that does something like the following

cudaEvent_t event;
cudaEventCreate(&event);
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
cudaEventRecord(event);
// time-point B
cudaEventSynchronize(event);
// time-point C

I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.

What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?

I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221

harrism · Accepted Answer

cublasSetVectorAsync is a thin wrapper around cudaMemcpyAsync. Unfortunately, in some circumstances, the name of this function is a misnomer, as explained on this page from the CUDA reference manual.

Notably:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

And

For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.

So the solution to your problem is likely to just allocate x, your host data array, using cudaHostAlloc, rather than standard malloc (or C++ new).

Alternatively, if your GPU and CUDA version support it, you can use malloc and then call cudaHostRegister on the malloc-ed pointer. Note in the documentation the condition that you must create your CUDA context with the cudaDeviceMapHost flag in order for cudaHostRegister to have any effect (see the documentation for cudaSetDeviceFlags.

Asynchronous CUDA transfer calls not behaving asynchronously

Answers (2)

Related Questions