user3368803
user3368803

Reputation: 51

CUDA: determining # of thread blocks within a grid

I am looking at one of the simple sample CUDA programs and had a question about how it determined the # of blocks in the grid. The relevant part of the code is:

// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

Why is blocksPerGrid equal to

(numElements + threadsPerBlock - 1) / threadsPerBlock 

and not just

numElements / threadsPerBlock

?

Upvotes: 1

Views: 178

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152259

This gives integer division:

numElements / threadsPerBlock

If numElements is not evenly divisible by threadsPerBlock, then this won't give the correct result - we need an extra threadblock to cover the "extra" threads needed.

This arithmetic:

(numElements + threadsPerBlock - 1) / threadsPerBlock 

gives us an extra threadblock, as needed.

Upvotes: 3

Related Questions