CUDA coalesced access for two-dimensional block

Question

For 1D cases I've pretty much understood the whole coalesced access requirement of global memory in CUDA.

However I'm a bit stuck for two-dimensional case (that is we have a 2D grid, made of 2D blocks).

Suppose I have a vector in_vector and in my kernel I want to access it in a coalesced manner. Like so:

__global__ void my_kernel(float* out_matrix, float* in_vector, int size)
{
   int i = blockIdx.x * blockDim.x + threadIdx.x;
   int j = blockIdx.y * blockDim.y + threadIdx.y;
   // ...
   float vx = in_vector[i]; // This is good. Here we have coalesced access
   float vy = in_vector[j]; // Not sure about this. All threads in my warp access the same global address. (See explanation)
   // ...
   // Do some calculations... Obtain result
}

In my understanding for this 2D case the threads inside the block are "arranged" in a column-major fashion. Eg: assuming a (threadIdx.x, threadIdx.y) notation:

the first warp would be: (0, 0), (1, 0), (2, 0), ..., (31, 0),
the second warp would be: (0, 1), (1, 1), (2, 1), ..., (31, 1),
and so on...

In this case calling in_vector[i] gives us a coalesced access because each consecutive thread in the same warp will access consecutive addresses. However calling in_vector[j] seems a bad ideea, as each consecutive thread will access the same address in global memory (For example all the threads in warp 0 will access in_vector[0], which would give us 32 different global memory requests)

Did I understood this correctly? If so how can I make a coalesced access to global memory using in_vector[j]?

talonmies · Accepted Answer

What you have shown in your question is only correct for certain block sizes. Your "coalesced" access:

int i = blockIdx.x * blockDim.x + threadIdx.x;
float vx = in_vector[i];

will result in coalesced access of in_vector from global memory only when blockDim.x is greater than or equal to 32. Even in the coalesced case, each thread within a block which shares the same threadIdx.x value reads the same word from global memory, which seems to be counter-intuitive and wasteful.

The correct way to ensure reads are unique per thread and coalesced is to calculate the thread number within the block and an offset within the grid, perhaps something like:

int tid = threadIdx.x + blockDim.x * threadIdx.y; // must use column major order
int bid = blockIdx.x + gridDim.x * blockDim.y; // can either use column or row major
int offset = (blockDim.x * blockDim.y) * bid; // block id * threads per block
float vx = in_vector[tid + offset];

If your intention really isn't to read a unique value per thread, then you can save a lot of memory bandwidth and achieve coalescing using shared memory, something like this:

__shared__ float vx[32], vy[32]; 

int tid = threadIdx.x + blockDim.x * threadIdx.y;

if (tid < 32) {
    vx[tid] = in_vector[blockIdx.x * blockDim.x + tid];
    vy[tid] = in_vector[blockIdx.y * blockDim.y + tid];
}
__syncthread();

and you will get a single warp reading unique values into shared memory once. Other threads can then read values from shared memory without requiring any further global memory access. Note that in the above example I followed the conventions of your code, even if it doesn't necessarily make that much sense to read in_vector twice in that way.

CUDA coalesced access for two-dimensional block

Answers (1)

Related Questions