Bank conflict in CUDA when reading from the same location

Question

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:

__global__ void my_kernel(const float4 * key_pts)
{
    if (key_pts[blockIdx.x] < 0 return;
}

The kernel is configured as follows:

dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<>>(key_pts);

My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.

I was thinking I could do something like the following in case we have sub-optimal access:

__global__ void my_kernel(const float4 * key_pts)
{
    __shared__ float x;
    if (threadIdx.x == 0 && threadIdx.y == 0)
        x = key_pts[blockIdx.x];

    __syncthreads();

    if (x < 0) return;
}

Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.

Robert Crovella · Accepted Answer

bank conflicts apply to shared memory, not global memory.

Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:

Note that threads can access any words in any order, including the same words.

Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.

I wouldn't expect much performance difference between the two cases.

Bank conflict in CUDA when reading from the same location

Answers (1)

Related Questions