Daniel K
Daniel K

Reputation: 33

Avoiding CudaMemcpy in an iterative loop

I am wondering what is the best way to do the following in Cuda: Imagine* you have a long array and want the sum of all elements to be below 1. And if the sum is above 1 you divide every element by 2 and calculate the sum again. Dividing by two and calculating the sum are done on the gpu. My question is now: what is the best way to check whether the sum is below 1 or not on the cpu side? I could do cudaMemcpy within every iteration, but I also read (and have seen) that it is better to do as few transfers between the two memory as possible. I have found dynamic parallelism and thought maybe I start a kernel with one block and one thread that does the while loop and calls the sum and divide kernels, but unfortunately my hardware only has compute capability 3.2 and dynamic parallelism only starts with 3.5. So is there any other way besides doing a cudaMemcpy every iteration to tell the cpu that it can stop doing the while loop?

*The algorithm above is only a toy problem to explain the situation (hopefully). The actual algorithm is a newton-raphson method, but my question will remain valid for any iterative method where I have to decide whether to stop or not given a value that was calculated on the gpu.

Upvotes: 3

Views: 1107

Answers (1)

Jez
Jez

Reputation: 1781

For compute capability >= 3.5 the answer, as you correctly identify, would probably be dynamic parallelism.

For compute capability < 3.5 things are less clear. There are two options: the first is to look at the latency cost of the memcpy and kernel launch. The second is to use more advanced techniques to get finer control over your blocks.

Optimising for latency

If using memcpy, make sure you don't synchronize before launching the memcpy. If you don't synchronize, then much of the overhead associated with the copy can be hidden by the kernel.

That said, the lowest latency path for this case is probably found using mapped memory: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mapped-memory. By using mapped memory the kernel will write directly to host memory without you having to explicitly launch a cudaMemcpy.

Block control

For this problem we don't actually need global synchronisation, so by being clever we can avoid some trips to the host. In this case I would consider over-subscribing the GPU. If you know you need x blocks to complete an iteration of your problem, consider launching, for example, 5x blocks. Because the order in which blocks are launch is undefined, you then need to create an ordering using atomics (atomically incrementing a global integer once per block).

With this block ordering you now know which blocks are going to take part in the first step of the iteration. Any blocks not taking part in the first iteration can wait by spinning on a flag:

do {
  locked = volatileLoad(flag); // Make sure this is volatile
}
while (locked);

Once the first batch of blocks completes its operation, and the output is written to global memory, you can set the flag (make sure you use threadfence correctly!) allowing the blocks for the next step to start. These blocks can then either do the next step, or return immediately (after allowing the blocks depending on them to continue) if your condition is already met.

The net result of this is that we already have blocks ready on the GPU waiting to start. By managing our block ordering we know that each iteration will always have enough blocks to complete, so the spinning blocks will always be released. The three things you need to make sure are correct are:

  1. You manage your own block IDs using atomics.
  2. You load the flag using the volatile keyword to ensure the correct value is read.
  3. You apply a threadfence to ensure the output is visible before allowing dependent blocks to continue.

Obviously launching the correct number of blocks is unlikely, so you will have to go back to the host to launch more blocks from time to time. The overhead of launching too many blocks shouldn't be too bad, but will also be a risk.

Before you implement this do make sure that the latency cost of your copies are actually resulting in a significant slowdown. The overhead to copy to host and conditionally launch another kernel should be of the order of 20 microseconds per iteration. This method will add quite a lot of complexity to your code, and so be sure you need to save those microseconds!

Upvotes: 3

Related Questions