Reputation: 802
Does anyone know if CUDA will run more threads than you ask it to?
I tried to run 260 threads on a block size of 256 in a C++ application within Visual Studio. So it looks like CUDA is using 2 blocks during debugging.
I was expecting the running threads would be from 0 to 259 but it looks like there is a thread of 260?? And it is blocking at a __syncthreads() call? Why? How?
Or is it forced to run 32 threads (a warp) in the second block instead of the required 4?
I was and is still not expecting this to be true.
I have been using CUDA for a while and I have not encountered this before even when running with any oddball numbers of threads. This is new to me.
Edited: It blocks on 260, 272 but not 288 no. of active threads. By active, I mean threads that actually do something and not passive threads that merely exist along side the active ones as cleared up by Robert Crovella's answer.
Upvotes: 0
Views: 139
Reputation: 151799
In CUDA, the number of threads you "ask" for is given by the the number of blocks and the number of threads per block:
kernel<<<number_of_blocks, number_of_threads_per_block>>>(...);
In the simple 1D case, the total number of threads you are asking for is the product of those two numbers.
That is how many threads your kernel launch will have access to. So if you specify 256 threads per block, then the choices you have for the total number of threads is 256, 512, 768, 1024, etc. If you specify 256 threads per block, there is no way you can "ask for" 260 threads.
If you have an if test (a "thread check") like this in your kernel code:
idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < 260)
// body of if statement
then, in the body of that if statement, only up to 260 threads should be "participating", and yes, that has implications for __syncthreads()
usage in the body of that if statement.
Here is one possible method for proper __syncthreads()
usage in a kernel constructed with if statements that limit thread participation.
Upvotes: 3