Reputation: 1796
I've always been wondering about the caching behaviour of global data in OpenCL.
Lets say I have a pointer to global memory in a kernel. Now I read the location the pointer points to. Later in the kernel I might need the same data again, so I read it again through the pointer.
Now the question is, will this data be cached, or will it be reread from global memory every single time because other threads could have modified it? If it's not cached, then I'd have to make a local copy every time so I don't lose tons of performance by constantly accessing global memory.
I know this might be vendor specific, but what do the specs say about this?
Upvotes: 2
Views: 1993
Reputation: 166
Those who want benefits of local cashing for work-items within a work-group, for example on FPGAs, can read this paper by Andrew Ling at IWOCL2017 https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf. It is a good example of having correct usage of local caches and clever communication for dataflow computing. Those who want convenience of cache in parallel peer-to-peer setting and still have hardware do this for them should consider POWER8 or POWER9 chips. These conveniences come at the cost: caching global or virtual memory cluster interconnect may have to have several TBs of bandwidth. Real question is: What is the value of caching for dataflow compute e.g. ML, especially on clusters, vs. reducing communication and increasing data reuse by other means.
Upvotes: 0
Reputation: 6333
There is some caching but the key to great GPU compute performance it is move "accessed many time" data to private or shared local memory and not re-read it. In a way, you can think of this as "you control the caching". In OpenCL this would be done in your kernel (in parallel!) and then you'd have a memory barrier (to ensure all work items have finished the copy) then your algorithm has access to the data in fast memory. See the matrix multiply example (since each column and row contributes to multiple output values, copying them to shared local memory accelerates the algorithm.
Upvotes: 3