Reputation: 255
I have a following array, that consists of, say, 16 element and is in fact assembled from many small arrays:
[1,1,1,1|2,2,2,2,2,2|3,3,3,3,3,3|4,4,4,4]
In reality, an array is quite long, about 512 or 1024, total array length is is lesser than the maximum block size, so lesser than 1024. Array resides in shared memory because it is a result of previous computations. Every subarray, except for first and last is of the same size and all subarrays have an even number of elements.
In one CUDA block, I want to sum this array so that the result is
[4,...|12,...|18,...|16,...]
if the subarrays were of the length of the power of two there would be no problem, but that is rarely the fact, so one option would be to fill the array with 0s in such a manner that the subarrays would have the length of the power of two:
[1,1,1,1|2,2,2,2,2,2,0,0|3,3,3,3,3,3,0,0|4,4,4,4]
But this is a waste of tremendous amount of processing power and shared mem if I had subarrays of length 34 and I would add to each 30 0 valued elements to fill up to 64.
Does anyone see any efficient solution to sum such array?
Upvotes: 2
Views: 248
Reputation: 131546
Assuming the overall length of the block is fixed (either at runtime but before launch, or at compile time), why not do the following (for each thread)? :
There a few more details like how this changes in the last block but that should do it pretty well I think.
Upvotes: 1