Order of execution in CUDA or OpenCL kernels - for memory access optimisation

Question

Is there any hint regarding order of execution of kernels?

Let's say I start processing of grid of 1024x1024 with work groups of 8x8. And I have 1080 with 20 compute units each 128 cores - total 2560 cores.

Now it is clear that in average each physical core would process in average about 400 items in the grid. Question what statistically would be order of execution of each core? Would it be row major? Column major or each core would get its own "subarea" to work on?

The question is important in order to make sure that the memory access is cache friendly.

talonmies · Accepted Answer

Let's say I start processing of grid of 1024x1024 with work groups of 8x8. And I have 1080 with 20 compute units each 128 cores - total 2560 cores.

That isn't really a valid way to visualize the GPU. You have 20 compute units. That's it. The "cores" are really a pair of (2 x 32) lane vector ALU units each with an instruction scheduler and a shared L1 cache.

Now it is clear that in average each physical core would process in average about 400 items in the grid.

That doesn't follow for a number of reasons. How work is distributed depends on the code you write and the execution parameters you use to run the code. There is no intrinsic relationship between the size of the inputs to a kernel and the amount of parallel operations which a given SM or "core" within an SM would perform.

Question what statistically would be order of execution of each core?

Undefined. CUDA makes no guarantees, implied or otherwise, about execution order.

Would it be row major? Column major ..?

Still undefined.

or each core would get its own "subarea" to work on?

It is up to the programmer to define how the logical thread/block numbering scheme which the programming model exposes would map to features of the input data or memory.

The question is important in order to make sure that the memory access is cache friendly.

The GPU has a hierarchical cache design which means that it isn't actually important in the way you are imagining. There are well documented programming guidelines for ensuring maximal memory throughput and cache utilization. They are not influenced by execution order in the way your question implies.

Order of execution in CUDA or OpenCL kernels - for memory access optimisation

Answers (1)

Related Questions