Reputation: 3
I've got a CUDA kernel that is called many times (1 million is not the limit). Whether we launch kernel again or not depends on flag (result_found), that our kernel returns.
for(int i = 0; i < 1000000 /* for example */; ++i) {
kernel<<<blocks, threads>>>( /*...*/, dev_result_found);
cudaMemcpy(&result_found, dev_result_found, sizeof(bool), cudaMemcpyDeviceToHost);
if(result_found) {
break;
}
}
The profiler says that cudaMemcpy takes much more time to execute, than actual kernel call (cudaMemcpy: ~88us, cudaLaunch: ~17us).
So, the questions are:
1) Is there any way to avoid calling cudaMemcpy here?
2) Why is it so slow after all? Passing parameters to the kernel (cudaSetupArgument) seems very fast (~0.8 us), while getting the result back is slow. If I remove cudaMemcpy, my program finishes a lot faster, so I think that it's not because of synchronization issues.
Upvotes: 0
Views: 712
Reputation: 1791
1) Is there any way to avoid calling cudaMemcpy here?
Yes. This is a case where dynamic parallelism may help. If your device supports it you can move the entire loop over i
on to the GPU and launch further kernels from the GPU. The launching thread can then directly read dev_result_found and return if it has finished. This completely removes cudaMemcpy
.
An alternative would be to greatly reduce the number of cudaMemcpy
calls. At the start of each kernel launch check against dev_result_found
. If it is true, return. This way you only need to perform the memcpy
every x
iterations. While you will launch more kernels than you need to, these will be very cheap as the excess will return immediately.
I suspect a combination of the two methods will give best performance.
2) Why is it so slow after all?
Hard to say. I'd suggest your numbers may be a bit suspicious - I guess you're using the API trace from the profiler. This measures time as seen by the CPU, so if you launch an asynchronous call (kernel launch) followed by a a sychronous call (cudaMemcpy) the cost of synchronisaiton will be measured with the memcpy.
Still, if your kernel is relatively quick-running the overhead of the copy may be significant. You are also unable to hide any launch overheads, as you cannot schedule the next launch asynchronously.
Upvotes: 1