Reputation: 51
I am looking at one of the simple sample CUDA programs and had a question about how it determined the # of blocks in the grid. The relevant part of the code is:
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
Why is blocksPerGrid equal to
(numElements + threadsPerBlock - 1) / threadsPerBlock
and not just
numElements / threadsPerBlock
?
Upvotes: 1
Views: 178
Reputation: 152259
This gives integer division:
numElements / threadsPerBlock
If numElements
is not evenly divisible by threadsPerBlock
, then this won't give the correct result - we need an extra threadblock to cover the "extra" threads needed.
This arithmetic:
(numElements + threadsPerBlock - 1) / threadsPerBlock
gives us an extra threadblock, as needed.
Upvotes: 3