Kirill Kuvshinov
Kirill Kuvshinov

Reputation: 3

Eliminate cudaMemcpy between kernel calls

I've got a CUDA kernel that is called many times (1 million is not the limit). Whether we launch kernel again or not depends on flag (result_found), that our kernel returns.

for(int i = 0; i < 1000000 /* for example */; ++i) {
    kernel<<<blocks, threads>>>( /*...*/, dev_result_found);
    cudaMemcpy(&result_found, dev_result_found, sizeof(bool), cudaMemcpyDeviceToHost);
    if(result_found) {
        break;
    }
}

The profiler says that cudaMemcpy takes much more time to execute, than actual kernel call (cudaMemcpy: ~88us, cudaLaunch: ~17us).

So, the questions are:

1) Is there any way to avoid calling cudaMemcpy here?

2) Why is it so slow after all? Passing parameters to the kernel (cudaSetupArgument) seems very fast (~0.8 us), while getting the result back is slow. If I remove cudaMemcpy, my program finishes a lot faster, so I think that it's not because of synchronization issues.

Upvotes: 0

Views: 712

Answers (1)

Jez
Jez

Reputation: 1791

1) Is there any way to avoid calling cudaMemcpy here?

Yes. This is a case where dynamic parallelism may help. If your device supports it you can move the entire loop over i on to the GPU and launch further kernels from the GPU. The launching thread can then directly read dev_result_found and return if it has finished. This completely removes cudaMemcpy.

An alternative would be to greatly reduce the number of cudaMemcpy calls. At the start of each kernel launch check against dev_result_found. If it is true, return. This way you only need to perform the memcpy every x iterations. While you will launch more kernels than you need to, these will be very cheap as the excess will return immediately.

I suspect a combination of the two methods will give best performance.

2) Why is it so slow after all?

Hard to say. I'd suggest your numbers may be a bit suspicious - I guess you're using the API trace from the profiler. This measures time as seen by the CPU, so if you launch an asynchronous call (kernel launch) followed by a a sychronous call (cudaMemcpy) the cost of synchronisaiton will be measured with the memcpy.

Still, if your kernel is relatively quick-running the overhead of the copy may be significant. You are also unable to hide any launch overheads, as you cannot schedule the next launch asynchronously.

Upvotes: 1

Related Questions