CUDA: determining # of thread blocks within a grid

Question

I am looking at one of the simple sample CUDA programs and had a question about how it determined the # of blocks in the grid. The relevant part of the code is:

// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads
", blocksPerGrid, threadsPerBlock);
vectorAdd<<>>(d_A, d_B, d_C, numElements);

Why is blocksPerGrid equal to

(numElements + threadsPerBlock - 1) / threadsPerBlock

and not just

numElements / threadsPerBlock

?

CUDA: determining # of thread blocks within a grid

Answers (1)

Related Questions