Reputation: 13195
I have an image processing kernel that uses a buffer of flags that is too large to fit into local memory. The flags are accessed in predictable, raster pattern (upper left to lower right hand side).
My idea is to store flags in global memory, and use local memory as a cache for global. So, as I progress along the raster pattern, I want to read flags from global to local, do some processing, then write flags back to global. But, I want to hide the latency involved.
So, suppose I access my image as a series of locations: a1,a2,a3......
I want to do the following:
a1
flagsa2
flagsa2
flags are being fetched, process a1
location and store back
to global memorya3
flagsa3
flags are being fetched, process a2
location and store back
to global memoryHow should I structure my code to ensure that the latency is hidden ?
Do I need to use vload/vstore
to do this? Or will the GPU hardware
do the latency hiding automatically ?
Upvotes: 0
Views: 727
Reputation: 853
Looks like you do have all the requirements to have a highly efficient kernel. You can predict the data access pattern, coalesced reading from gmem and you do have a complex processing.
So the hardware will hide the global load lattency automatically by picking warps in resident blocks "in flight" to process while the stalled warp is loading the values, but you must to be plenty of elegible warps. I think that you may dealing with two challanges in here:
It is very hard to tell which one do you have, but some experiments can help you. The Achieved Occupancy, Instruction Statistics, Branch Statistics and Issue Efficiency should help you to pinpoint the kernel limitation. You may even bump in a processo pipeline limitation.
Please, be aware that the "local" memory is off-chip (as global memory), but some hardware allow it to be cached in L1. Looks like that you even may use the shared memory to improve your processing, as the example.
So basically as long you have more than 80% of elegible warps, you should not have problems to hide lattency.
Upvotes: 1
Reputation: 6343
The key is to make sure your reads are coalesced - that's the only way to get peak memory bandwidth. Then, keep the kernel complexity low enough so that occupancy is high enough to ensure all compute is hidden behind memory access. Then you will be running as fast as possible.
Upvotes: 1
Reputation: 2916
The CUDA Surface concept might be a good tool for your case. The access pattern is optimized for image processing, and it uses the texture cache, so no need to perform caching yourself. The texture cache is per block, hence you may want to use 2D thread distribution to have small squares processed by a single block.
Hiding latency is naturally done by scheduling more threads and blocks than hardware can simultaneously process. Depending on the Compute Capability of the device, the ratio between the "Maximum number of resident threads per multiprocessor" (2048 since CC 3.0) and the number of CUDA cores per SM will give you a good hint to calculate that total number of threads (threads * blocks) you want to schedule to best hide latency. Note that the optimal actually depends on the code itself, number of registers your kernel needs, etc.
Upvotes: 3
Reputation: 8410
There is no need to do this manually. GPU devices already do this for you.
The compute core executes some of the workitems in batches (warps) , and when the batch can't continue due to waiting for global memory it launches another batch in the mean time and put that batch to sleep.
Upvotes: 2