Alex
Alex

Reputation: 13116

Load from shared memory the one and same 32 bytes (ulong4) by the each warp thread

If each warp accesses shared memory at the same address, how would that load the 32 bytes of data (ulong4)? Will it be 'broadcasted'? Would the access time be the same as if each thread loaded the 2 bytes unsigned short int?

Now, in case I need to load from shared memory 32/64 same bytes in each warp, how could I do this?

Upvotes: 1

Views: 965

Answers (1)

tera
tera

Reputation: 7255

On devices before compute capability 3.0 shared memory accesses are always 32 bit / 4 bytes wide and will be broadcast if all threads of a warp access the same address. Wider accesses will compile to multiple instructions.

On compute capability 3.0 shared memory accesses can be configured to be either 32 bit wide or 64 bit wide using cudaDeviceSetSharedMemConfig(). The chosen setting will apply to the entire kernel though.


[As I had originally missed the little word "shared" in the question, I gave a completely off-topic answer for global memory instead. Since that one should still be correct, I'll leave it in here:]

It depends:

  • Compute capability 1.0 and 1.1 don't broadcast and use 64 separate 32 byte memory transactions (two times 16 bytes, extended to the minimum 32 byte transactions size for each thread of the warp)
  • Compute capability 1.2 and 1.3 broadcast, so two 32 byte transactions (two times 16 bytes, extended to minimum 32 byte transaction size) suffice for all threads of the warp
  • Compute capability 2.0 and higher just read a 128 byte cache line and satisfy all requests from there.

The compute capability 1.x devices will waste 50% of the transferred data, as a single thread can load at most 16 bytes, but the minimum transaction size is or 32 bytes. Additionally, 32 byte transactions are a lot slower that 128 byte transactions.

The time will be the same as if just 8 bytes were read by each thread because of the minimum transaction size, and because data paths are sufficiently wide to transfer either 8 or 16 bytes to each thread per transaction.

Reading 2× or 4× the data will take 2× or 4× as long on compute capability 1.x, but only minimally longer on 2.0 and higher if the data falls into the same cache line so no further memory transactions are necessary.

So on compute capability 2.0 and higher you don't need to worry. On 1.x read the data through the constant cache or a texture if it is constant, or reorder it in shared memory otherwise (assuming your kernel is memory bandwidth bound).

Upvotes: 1

Related Questions