For which sizes are plain loads and store to global memory in CUDA atomic?

Question

Are general reads and writes to global memory atomic in CUDA if:

It is a 4 byte instruction? (I assume yes)
It is a 8 byte or 16 byte instruction? (I assume yes)

Are at least on Kepler and Fermi general 4 byte reads and writes to global memory atomic on Warp level or 8/16 Byte instructions atomic on half/quater Warp level if:

All warp threads access the same 32-byte L2 transaction block? (I assume yes)
Warp threads access different 32-byte L2 transaction blocks but all warp threads access the same 128 byte L2 cache line? (I assume no)
All warp threads accesss different L2 cache lines? (I assume no)

If any of those assumptions about the atomicness on warp level is correct, is there any method of harnessing this knowledge without risking the compability to future Compute Capabilites?

For which sizes are plain loads and store to global memory in CUDA atomic?

Answers (1)

Related Questions