user3377298
user3377298

Reputation: 3

How to ensure no bank conflict with 3D shared data access in CUDA

I'm using CUDA to do some operations on several large, three dimensional data sets of the same size, each consisting of floats.

Example below:

out[i+j+k]=in_A[i+j+k]*out[i+j+k]-in_B[i+j+k]*(in_C[i+j+k+1]-in_C[i+j+k]);

where (numCols, numDepth refer to y and z dimensions of the 3D sets (e.g. out, in_A, in_C, etc) and:

int tx=blockIdx.x*blockDim.x + threadIdx.x; int i=tx*numCols*numDepth;

int ty=blockIdx.y*blockDim.y + threadIdx.y; int j=ty*numDepth

int tz=blockIdx.z*blockDim.z + threadIdx.z; int k=tz;

I've set up my kernel to be run on (11,14,4) blocks with (8,8,8) threads in each block. Being setup this way, each thread corresponds to an element from each data set. To keep with the way I've setup my kernel, I am using 3D shared memory to reduce redundant global reads for in_C:

(8x8x9 instead of 8x8x8 so that the very edge in_C[i+j+k+1] can be loaded as well)

__shared__ float s_inC[8][8][9];

There's other Stack Exchange posts (ex link) and CUDA docs that deal with 2D shared memory and describe what can be done to ensure there's no bank conflicts, such as padding the column dimension by one and accessing the shared array using threadIdx.y then threadIdx.x, but I couldn't find one that describes what happens when one uses the 3D case.

I would imagine that the same rules apply from the 2D case as to the 3D case, just by thinking of it in the 2D scheme being applied Z times.

So by this thinking, accessing s_inC by:

s_inC[threadIdx.z][threadIdx.y][threadIdx.x]=in_C[i+j+k];

would prevent threads in half warps from accessing the same bank at the same time, and the the shared memory should be declared as:

__shared__ float s_inC[8][8+1][9];

(leaving out syncs, boundary checks, inclusion of the very edge case in_C[i+j+k+1], etc).

Are the previous two assumptions correct and prevent bank conflicts?

I'm using Fermi hardware, so there are 32 32bit shared memory banks

Upvotes: 0

Views: 233

Answers (1)

Vitality
Vitality

Reputation: 21495

I think that your conclusions about bank conflict prevention are questionable.

Assuming 8x8x8 threads blocks, then an access like

__shared__ int shData[8][8][8];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

will give no bank conflict.

Opposite to this, with 8x8x8 threads blocks, then an access like

__shared__ int shData[8][9][9];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

will give bank conflicts.

This is illustrated by the figure below in which the yellow cells indicate threads from the same warp. The figure reports, for each 32 bits bank, the thread accessing it as the tuple (threadIdx.x, threadIdy.y, threadIdz.z). The red cells are the padding cells you are using which are not accessed by any thread.

enter image description here

Upvotes: 1

Related Questions