Martin Berger
Martin Berger

Reputation: 1128

Effect of distance between CUDA threads in block?

I have a naive question about GPU programming. (ChatGPT and Claude didn't really give me a convincing answer. Maybe I'm prompting badly.)

GPU programming languages like CUDA and OpenCL organise threads in (using Nvidia terminology) a 3D block structure, and blocks in a 3D grid. I know that this is convenient and natural for computer graphics. But I wonder if the 'distance' (see below for a definition) of two threads in a block, or two blocks in a grid, has any technical effects for the performance of thread-execution?

What I mean is that there is a natural distance between two threads T1, and T2 in the same block at block indices

The natural distance is the 3-dimensional euclidean distance (but other choices are possible). Does this distance have any hardware effects? (E.g. if T1 and T2 are close then they can commuicate faster?) I think the answer is negative, but I could not find a convincing explanation online.

A similar question can be asked about the distance of blocks in a grid.

Upvotes: 0

Views: 78

Answers (1)

ProjectPhysX
ProjectPhysX

Reputation: 5754

You probably have something like the CPU core-to-core latency map in mind: enter image description here (source: https://www.anandtech.com/show/21124/amd-ryzen-threadripper-7980x-and-7970x-review/4)

The prerequisite is that the cores can communicate with each other - on a CPU they can. For GPUs there is no such map, because the streaming multipricessors on a GPU cannot communicate with each other. Communication is only possible from each SM to VRAM, and among the threads within each SM via shared memory (L1 cache).

The thread block coordinates you specify in software are not where thread blocks are executed on the hardware. The GPU scheduler dynamically assignes thread blocks to free SMs, and you have zero control over that on application side. 3D Euclidian distance is purely a software thing - the thread block coordinates all get linearized under the hood, so their 3D coordinates have no meaning other than from where to load the data. On the hardware level there might be latency differences between SMs, depending on their physical position on the chip and trace length to the memory controller, but you have no control over that either.

Upvotes: 1

Related Questions