CUDA motivation for multi-dimensional kernel execution

Question

The division of work CUDA imposes into blocks is logical because it reflects the hardware (some amount of execution threads within a single execution unit, all in the same "block").

However, as I'm looking at implementation of image processing algorithms, it's not entirely clear why I should have 2D grids of blocks, each being a 2D grid of threads. Why won't 1D do? After all, the kernel call usually just sees the image as a linear 1D array of pixels anyway and has to compute its global index by multiplying the usual row * column + offset in column.

One guess I have is for spatial locality. We usually compute stuff for a pixel based on the pixels around it, so the 2D grid of threads makes sure that all adjacent pixels run within the same execution unit, thus can share local memory etc. Is this correct? Anything else I am missing? Maybe ease of programming somehow (although that's hard to believe since the code is computing a 1D offset anyway)

Thanks in advance

Jaa-c · Accepted Answer

AFAIK the only reason for 2D / 3D grid is if it relates to the data. If you have 2D data (image...) or 3D data (particle system, etc.), you can make the code more readable by using appropriate block dimesions. Also, on older cards, there was 65535 limit for number of blocks in one dimension, so other dimensions were used to get around it.

There should be no difference in performance whether you use 1D block of threads or 2D/3D block.

CUDA motivation for multi-dimensional kernel execution

Answers (1)

Related Questions