Cuda, determine the last block on SM

Question

In short: is that possible to determine if a block is the last (and if the first) on that particular SM?

Details: I have a problem, where each block make a quite complex calculation, which results in an array of about 2K elements, and i want so sum these elements. I have about 3K blocks. But if i atomic add at the end of each block to a global memory array, that could slow badly. So what i would like to do:

Use shared array to sum values in each SM
If the block is the first in that SM (there was no any block running yet on that particular SM) then initialize the shared array (clear with 0)
Do the calculation, and add the result to the shared array
If it's the last block in this SM , atomic add the shared array values to global array.

Is this possible? Or other solution?

Robert Crovella · Accepted Answer

It's not possible.

Shared memory is allocated per block. The lifetime of the shared memory begins when the block begins and ends when the block ends. Shared memory of other blocks on the SM will be separate, and it's not legal or valid to assume they would happen to be in the same place.

Each block should do it's own reduction, and write it's values to global memory. If you want to avoid the atomics, then have each block write it's own values to separate locations in shared memory, and have the last block in the grid perform the final calculations. This is possible following the method outlined in the threadfence reduction sample code

You could also have each block loop over multiple data sets. In that case, each block will be able to accumulate the results from several data sets into shared memory, before writing the intermediate results to global memory.

Cuda, determine the last block on SM

Answers (1)

Related Questions