Reputation: 10996
I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.
Upvotes: 0
Views: 223
Reputation: 152063
bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.
Upvotes: 3