Reduce in Cuda for arbitrary number of elements

Question

How can I implement version 7 of the code given in the following link: http://www.cuvilib.com/Reduction.pdf
for an input array whose size is an arbitrary number, in other words, not a power of 2?

Robert Crovella · Accepted Answer

Version 7 already handles an arbitrary number of elements.

Perhaps instead of referring to the cuvilib link, you should look at the link to the relevant NVIDIA CUDA reduction sample. It includes essentially the pdf file you are using, but also sample codes that implement reductions 1 through 7 (labelled reduce0 through reduce6)

If you study the description of the reduction 7 in the document, you'll see that the initial reduction steps are handled via a while loop, that is causing the grid to loop through memory. As it loops through memory, each thread is accumulating multiple reduction elements.

This initial while loop is not limited to a particular size of problem (e.g. power of 2).

Due to the initial handling of the reduction via this while loop, later steps can be done as a super-efficient power of 2 at the threadblock level, as has been previously discussed in that document. But the initial input set size is not limited to a power of 2.

Please study the code given in the CUDA sample (reduce6).

Reduce in Cuda for arbitrary number of elements

Answers (1)

Related Questions