Reputation: 13195
I am thinking of converting my kernel from buffer to 2d image. Suppose 16 threads in a workgroup access 16 consecutive pixels in one row of an image. Is this access coalesced?
Also, what is the best accesss pattern to read in a (n x m ) rectangular strip, where m is 8 or 16?
Upvotes: 1
Views: 407
Reputation: 6333
On a GPU, OpenCL images are read through the texture cache. Details are implementation-dependent and not usually documented, but typically they are stored in tiles for locality of reference. So if adjacent work items are accessing nearby pixels, you have a good chance the read will be fast.
Because of the texture cache, the term "coalesced" only applies to buffer reads.
Compared to coalesced buffer reads, images may be slightly slower; however, compared to un-coalesced buffer reads but with some amount of locality, they can be faster.
A good example is a Gaussian blur decomposed into a vertical pass and horizontal pass: with buffers when you do the vertical pass in columns you get coalesced reads but when you to do the horizontal pass you don't get coalesced reads so it is very slow. So much so that all of the examples have a transpose step that uses shared local memory with coalesced reads and writes so you can re-use the vertical pass kernel to do the horizontal pass, and then transpose back. All well and good, but with images you can skip the transpose because both the vertical and horizontal passes are the same speed (which is slightly slower then the coalesced buffer reads, but way faster than the uncoalesced buffer reads). Overall it is faster because you can skip the two transpose kernels.
I hope the part about tiles, texture caching, and locality of reference help answer your question about access patterns.
Caveat: There are ways of creating an image from a buffer, but the memory layout is then linear and not tiled so the above is out the window (you can expect horizontally adjacent reads to be cached but not vertically adjacent reads).
Upvotes: 3