Reputation: 83
I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)
Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)
Upvotes: 2
Views: 1205
Reputation: 151993
In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__
memory/constant cache system:
__constant__
memory is.The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__
(assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__
. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.
If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__
, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda
tag covering it as well.
Upvotes: 8
Reputation: 23438
Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global
memory on OpenCL. Bank conflicts are usually an issue with local
memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.
Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.
Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:
local
memory for all threads in a group to share).constant
buffer, and explicitly place the more rarely used constants in a global
buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local
memory, and make your thread groups large and comparatively long-running to hide the copying hit.Upvotes: 1