Reputation: 375
Please understand me, but I don't know English.
My Computing environment is
I read already the documents about "coalesced memory access" on CUDA C programming guide. But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block. That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access. But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator
Upvotes: 0
Views: 1114
Reputation: 72349
Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz}
varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.
Upvotes: 2