CUDA coalesced access of FP64 data

Question

I am a bit confused with how memory access issued by a warp is affected by FP64 data.

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?
I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?
So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Here is my question now:

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

PS: I am mostly interested in Compute Capability 2.0+ architectures

talonmies · Accepted Answer

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

Correct

I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

Not exactly. There are also 32 byte transaction sizes.

So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Correct

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

Yes. The compiler will emit a 64 bit load instruction which will be serviced by two 128 byte transactions per warp when coalesced memory access is possible.

CUDA coalesced access of FP64 data

Answers (1)

Related Questions