user3329035
user3329035

Reputation: 83

Share large constant data among cuda threads

I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)

Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)

Upvotes: 2

Views: 1205

Answers (2)

Robert Crovella
Robert Crovella

Reputation: 151993

In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:

  1. It is not limited to 64kB like __constant__ memory is.
  2. It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).

The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.

This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.

If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.

Upvotes: 8

pmdj
pmdj

Reputation: 23438

Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global memory on OpenCL. Bank conflicts are usually an issue with local memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.

Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.

Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:

  • Reduce the amount of constant/global data you need, for example by using more efficient types, other compression mechanisms, or computing some of the values on the fly (possibly storing them in local memory for all threads in a group to share).
  • Cluster the most frequently used data in a small constant buffer, and explicitly place the more rarely used constants in a global buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local memory, and make your thread groups large and comparatively long-running to hide the copying hit.
  • Check if thread occupancy could be improved. It usually can, and this tends to give you substantial performance improvements in almost any situation. (except if your code is already extremely ALU/FPU bound)

Upvotes: 1

Related Questions