NVIDIA Architecture: CUDA threads and thread blocks

Question

This is mostly from the book "Computer Architecture: A Quantitative Approach."

The book states that groups of 32 threads are grouped and executed together in what's called the thread block, but shows an example with a function call that has 256 threads per thread block, and CUDA's documentation states that you can have a maximum of 512 threads per thread block.

The function call looks like this:

int nblocks = (n+255)/256
daxpy<<>>(n,2.0,x,y)

Could somebody please explain how thread blocks are structured?

RayaneCTX · Accepted Answer

The question is a little unclear in my opinion. I will highlight a difference between thread warps and thread blocks that I find important in hopes that it helps answer whatever the true question is.

The number of threads per warp is defined by the hardware. Often, a thread warp is 32 threads wide (NVIDIA) because the SIMD unit on the GPU has exactly 32 lanes of execution, each with its own ALU (this is not always the case as far as I know; some architectures have only 16 lanes even though thread warps are 32 wide).

The size of a thread block is user defined (although, constrained by the hardware). The hardware will still execute thread code in 32-wide thread warps. Some GPU resources, such as shared memory and synchronization, cannot be shared arbitrarily between any two threads on the GPU. However, the GPU will allow threads to share a larger subset of resources if they belong to the same thread block. That's the main idea behind why thread blocks are used.

NVIDIA Architecture: CUDA threads and thread blocks

Answers (1)

Related Questions