quirell
quirell

Reputation: 255

CUDA sum many small sized arrays

I have a following array, that consists of, say, 16 element and is in fact assembled from many small arrays:

[1,1,1,1|2,2,2,2,2,2|3,3,3,3,3,3|4,4,4,4]

In reality, an array is quite long, about 512 or 1024, total array length is is lesser than the maximum block size, so lesser than 1024. Array resides in shared memory because it is a result of previous computations. Every subarray, except for first and last is of the same size and all subarrays have an even number of elements.

In one CUDA block, I want to sum this array so that the result is

[4,...|12,...|18,...|16,...]

if the subarrays were of the length of the power of two there would be no problem, but that is rarely the fact, so one option would be to fill the array with 0s in such a manner that the subarrays would have the length of the power of two:

[1,1,1,1|2,2,2,2,2,2,0,0|3,3,3,3,3,3,0,0|4,4,4,4]

But this is a waste of tremendous amount of processing power and shared mem if I had subarrays of length 34 and I would add to each 30 0 valued elements to fill up to 64.

Does anyone see any efficient solution to sum such array?

Upvotes: 2

Views: 248

Answers (1)

einpoklum
einpoklum

Reputation: 131546

Assuming the overall length of the block is fixed (either at runtime but before launch, or at compile time), why not do the following (for each thread)? :

  1. Determine whether your element is the last in a sequence (by reading it and the next one)
  2. Use balloting to determine which threads in the warp have a transition
  3. Share the warps' ballot results with the whole block (only one lane per warp writes that to appropriate place in shared memory)
  4. "Search" the bitmap of being last-in-segment for the entire block, backwards from your position, to find the previous transition.
  5. Now you know the number of elements in your segment; multiply that by your element's value and write in to the result.

There a few more details like how this changes in the last block but that should do it pretty well I think.

Upvotes: 1

Related Questions