Reputation: 13
Hello I am new to cuda programming and I got a problem.
I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. And I want only one thread to sum all of them across blocks. I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly?
Thanks for your help.
Upvotes: 1
Views: 1148
Reputation: 131544
It would be faster to have one thread in each block perform an atomicAdd()
operation, adding the per-block-value to a single, grid-wide variable in global memory.
See the relevant section of the CUDA C Programming guide.
For a deeper exploration of optimizing reductions (= summation), albeit not necessarily the one you want to perform, have a look at Mark Harris' presentation: Optimizing Parallel Reduction in CUDA.
Upvotes: 2