CUDA coalesced one warp on multiple data

Question

I have a basic question on coalesced cuda access.

For example, I have an Array of 32 Elements and 32 threads, each thread accesses one element.

__global__ void co_acc ( int A[32], int B[32] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);
    B[inx] = A[inx]
}

Now, what I want to know: If I have the 32 threads, but an array of 64 elements, each thread has to copy 2 elements. To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?

__global__ void co_acc ( int A[64], int B[64] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);
    int actions = 64/blockDim.x;
    for ( int i = 0; i < actions; ++i )
        B[inx+(i*blockDim.x)] = A[inx+(i*blockDim.x)]
}

user703016 · Accepted Answer

To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?

Yes, that's a correct approach.

Strictly speaking it's not should but rather could: any memory access will be coalesced as long as all threads within a warp request addresses that fall within the same (aligned) 128 byte line. This means you could permute the thread indices and your accesses would still be coalesced (but why do complicated when you can do simple).

Another solution would be to have each thread load an int2:

__global__ void co_acc ( int A[64], int B[64] ) {
    int inx = threadIdx.x + (gridDim.x * blockDim.x);

    reinterpret_cast(B)[inx] = reinterpret_cast(A)[inx];
}

This is (in my opinion) simpler and clearer code, and might give marginally better performance as this may reduce the number of instructions emitted by the compiler and the latency between memory requests (disclaimer: I have not tried it).

Note: as Robert Crovella has mentioned in his comment, if you really are using thread blocks of 32 threads, then you are likely seriously underusing the capacity of your GPU.

CUDA coalesced one warp on multiple data

Answers (1)

Related Questions