Should Cuda shared memory arrays with type sizes of less than 4/8 bytes per element be padded to bank size manually?

Question

By that, I mean should a __shared__ char a[10] be padded to something like __shared__ char a[10][4] in order to avoid bank conflicts or will the NVCC compiler take care of this?

nglee · Accepted Answer

The array is not padded by the compiler. You can printf the address of each element and check for yourself.

Manual padding is also not necessary because access to 1 byte array would not produce bank conflicts. Only accesses to same bank with different word addresses will result in bank conflict. Access to same bank with same word address is handled by broadcasting mechanism of the underlying hardware, and this case is not a bank conflict.

Reducing bank conflicts is an effective way to reduce execution speed because if there is a bank conflict, the hardware would have to access the same bank more than once. However, the hardware is smart enough to handle access to same word address by different threads that it fetches the word only once and broadcasts the accessed data, not fetching the 4-byte word each time for each thread.

That said, access to __shared__ char a[10] is not elligible for bank conflict in any case. That is because the locations [0,1,2,3] all fall into same word address. They are all within a 4 byte length memory element. So, yes, they do access same bank, but since they access same word address, the hardware is smart enough to just distribute the accessed data to each thread. Note that it is the hardware that handles broadcasting, not the compiler.

It might be helpful to mention that 32 threads access shared memory simultaneously. This unit of 32 threads is called a warp, and warp is the unit of execution.

Consider __shared__ int b[64].

// case 1 - Not a bank conflict.
{
    int warpIdx = threadIdx.x % 32;
    b[warpIdx] = 1;
}

// case 2 - Bank conflict.
{
    int warpIdx = threadIdx.x % 32;
    int accessIdx = warpIdx * 2;
    b[accessIdx] = 1;
}

In case 2, thread n and thread n+16 in a warp (0 <= n < 16) access same bank with different address, resulting in bank conflict.

Should Cuda shared memory arrays with type sizes of less than 4/8 bytes per element be padded to bank size manually?

Answers (1)

Related Questions