Reputation: 13195

Using local/shared memory as a cache for global

I have an image processing kernel that uses a buffer of flags that is too large to fit into local memory. The flags are accessed in predictable, raster pattern (upper left to lower right hand side).

My idea is to store flags in global memory, and use local memory as a cache for global. So, as I progress along the raster pattern, I want to read flags from global to local, do some processing, then write flags back to global. But, I want to hide the latency involved.

So, suppose I access my image as a series of locations: a1,a2,a3...... I want to do the following:

fetch a1 flags
fetch a2 flags
while a2 flags are being fetched, process a1 location and store back to global memory
fetch a3 flags
while a3 flags are being fetched, process a2 location and store back to global memory
etc.

How should I structure my code to ensure that the latency is hidden ? Do I need to use vload/vstore to do this? Or will the GPU hardware do the latency hiding automatically ?

Upvotes: 0

Answers (4)

Thiago Conrado

Reputation: 853

Looks like you do have all the requirements to have a highly efficient kernel. You can predict the data access pattern, coalesced reading from gmem and you do have a complex processing.

So the hardware will hide the global load lattency automatically by picking warps in resident blocks "in flight" to process while the stalled warp is loading the values, but you must to be plenty of elegible warps. I think that you may dealing with two challanges in here:

there are no enough threads/blocks (you may have low occupancy) which means that the hardware wont be able to pick warps that are ready to process while the other warp is loading values from global (it can happen even because your kernel use too much resources to launch, so just a few can be handle by the processor at once, or because you are just launching a few);
the parallel processing is not enough, so even if you have a great access pattern, but your kernel has is plenty of "sequential" processing or is execution dependencies, or even plenty of syncronizations, so the warp will stall until the value was fully processed and since all the warps in flight were stalled, thus you can have a poor performance even with a lot of processing to be performed;

It is very hard to tell which one do you have, but some experiments can help you. The Achieved Occupancy, Instruction Statistics, Branch Statistics and Issue Efficiency should help you to pinpoint the kernel limitation. You may even bump in a processo pipeline limitation.

Please, be aware that the "local" memory is off-chip (as global memory), but some hardware allow it to be cached in L1. Looks like that you even may use the shared memory to improve your processing, as the example.

So basically as long you have more than 80% of elegible warps, you should not have problems to hide lattency.

Upvotes: 1

Dithermaster

Reputation: 6343

The key is to make sure your reads are coalesced - that's the only way to get peak memory bandwidth. Then, keep the kernel complexity low enough so that occupancy is high enough to ensure all compute is hidden behind memory access. Then you will be running as fast as possible.

Upvotes: 1

Florent DUGUET

Reputation: 2916

The CUDA Surface concept might be a good tool for your case. The access pattern is optimized for image processing, and it uses the texture cache, so no need to perform caching yourself. The texture cache is per block, hence you may want to use 2D thread distribution to have small squares processed by a single block.

Hiding latency is naturally done by scheduling more threads and blocks than hardware can simultaneously process. Depending on the Compute Capability of the device, the ratio between the "Maximum number of resident threads per multiprocessor" (2048 since CC 3.0) and the number of CUDA cores per SM will give you a good hint to calculate that total number of threads (threads * blocks) you want to schedule to best hide latency. Note that the optimal actually depends on the code itself, number of registers your kernel needs, etc.

Upvotes: 3

DarkZeros

Reputation: 8410

There is no need to do this manually. GPU devices already do this for you.

The compute core executes some of the workitems in batches (warps) , and when the batch can't continue due to waiting for global memory it launches another batch in the mean time and put that batch to sleep.

Upvotes: 2

Using local/shared memory as a cache for global

Answers (4)

Related Questions