CUDA shared memory, does gap-access pattern penalizes performance?

Question

I am dealing with a CUDA shared memory access pattern which i am not sure if it is good or has some sort of performance penalty.

Suppose i have 512 integer numbers in shared memory

__shared__ int snums[516];

and half the threads, that is 256 threads.

The kernel works as follows; (1) The block of 256 threads first applies a function f(x) to the even locations of snums[], then (2) it applies f(x) to the odd locations of snums[]. Function f(x) acts on the local neighborhood of the given number x, then changes x to a new value. There is a __syncthreads() in between (1) and (2).

Clearly, while i am doing (1), there are shared memory gaps of 32bits because of the odd numbers not being accessed. The same occurs in (2), there will be gaps on the even locations of snums[].

From what i read on CUDA documentation, memory bank conflicts should occur when threads access the same locations. But they do not talk about gaps.

Will there there be any problem with banks that could incur in a performance penalty?

Robert Crovella · Accepted Answer

I guess you meant:

__shared__ int snums[512];

Will there be any bank conflict and performance penalty?

Assuming at some point your code does something like:

int a = snums[2*threadIdx.x];  // this would access every even location

the above line of code would generate an access pattern with 2-way bank conflicts. 2-way bank conflicts means the above line of code takes approximately twice as long to execute as the optimal no-bank-conflict line of code (depicted below).

If we were to focus only on the above line of code, the obvious approach to eliminating the bank conflict would be to re-order the storage pattern in shared memory so that all of the data items previously stored at snums[0], snums[2], snums[4] ... are now stored at snums[0], snums[1], snums[2] ... thus effectively moving the "even" items to the beginning of the array and the "odd" items to the end of the array. That would allow an access like so:

int a = snums[threadIdx.x];  // no bank conflicts

However you have stated that a calculation neighborhood is important:

Function f(x) acts on the local neighborhood of the given number x,...

So this sort of reorganization might require some special indexing arithmetic.

On newer architectures, shared memory bank conflicts don't occur when threads access the same location but do occur if they access locations in the same bank (that are not the same location). The bank is simply the lowest order bits of the 32-bit index address:

snums[0]  : bank 0
snums[1]  : bank 1
snums[2]  : bank 2
...
snums[32] : bank 0
snums[33] : bank 1
...

(the above assumes 32-bit bank mode) This answer may also be of interest

CUDA shared memory, does gap-access pattern penalizes performance?

Answers (1)

Related Questions