Reputation: 57301
I am using my GPU concurrently with my CPU. When I profile memory transfers I find that the async calls in cuBLAS do not behave asynchronously.
I have code that does something like the following
cudaEvent_t event;
cudaEventCreate(&event);
// time-point A
cublasSetVectorAsync(n, elemSize, x, incx, y, incy, 0);
cudaEventRecord(event);
// time-point B
cudaEventSynchronize(event);
// time-point C
I'm using sys/time.h
to profile (code omited for clarity). I find that the cublasSetVectorAsync
call dominates the time as though it were behaving synchronously. I.e. the duration A-B is much longer than the duration B-C and increases as I increase the size of the transfer.
What are possible reasons for this? Is there some environment variable I need to set somewhere or an updated driver that I need to use?
I'm using a GeForce GTX 285 with Cuda compilation tools, release 4.1, V0.2.1221
Upvotes: 2
Views: 1152
Reputation: 27879
cublasSetVectorAsync
is a thin wrapper around cudaMemcpyAsync
. Unfortunately, in some circumstances, the name of this function is a misnomer, as explained on this page from the CUDA reference manual.
Notably:
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
And
For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.
So the solution to your problem is likely to just allocate x
, your host data array, using cudaHostAlloc
, rather than standard malloc
(or C++ new
).
Alternatively, if your GPU and CUDA version support it, you can use malloc
and then call cudaHostRegister
on the malloc
-ed pointer. Note in the documentation the condition that you must create your CUDA context with the cudaDeviceMapHost
flag in order for cudaHostRegister
to have any effect (see the documentation for cudaSetDeviceFlags
.
Upvotes: 3
Reputation: 34
In cuBLAS/cuSPARSE, things take place in stream 0 if you don't specify a different stream. To specify a stream, you have to use cublasSetStream (see cuBLAS documentation).
Upvotes: -1