CUDA thread execution order

Question

I have the following code for a CUDA program:

#include 

#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4

__global__ void hello()
{  

   printf("Hello. I'm a thread %d in block %d
", threadIdx.x, blockIdx.x);

}


int main(int argc,char **argv)
{
    // launch the kernel
    hello<<>>();

    // force the printf()s to flush
    cudaDeviceSynchronize();

    return 0;
}

in which every thread will print its threadIdx.x and blockIdx.x. One possible output of this program is this:

Hello. I'm a thread 0 in block 0
Hello. I'm a thread 1 in block 0
Hello. I'm a thread 2 in block 0
Hello. I'm a thread 3 in block 0
Hello. I'm a thread 0 in block 2
Hello. I'm a thread 1 in block 2
Hello. I'm a thread 2 in block 2
Hello. I'm a thread 3 in block 2
Hello. I'm a thread 0 in block 3
Hello. I'm a thread 1 in block 3
Hello. I'm a thread 2 in block 3
Hello. I'm a thread 3 in block 3
Hello. I'm a thread 0 in block 1
Hello. I'm a thread 1 in block 1
Hello. I'm a thread 2 in block 1
Hello. I'm a thread 3 in block 1

Running the program several times I get similar results, except that block order is random. For example, in the above output we have this block order 0, 2, 3, 1. Running the problem again I get 1,2,3, 0. This is expected. However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too.

I tried to change my code to force thread 0 in every block to take longer to execute. I did it like this:

__global__ void hello()
{  

    if (threadIdx.x == 0)
    {
        int k = 0;
        for ( int i = 0; i < 1000000; i++ )
        {
            k = k + 1;
        }
    }

   printf("Hello. I'm a thread %d in block %d
", threadIdx.x, blockIdx.x);

}

I would expect thread order to be 1,2,3, 0. However, I got a similar result to the one I have shown above where thread order was always 0, 1, 2, 3. Why is this happening?

Robert Crovella · Accepted Answer

However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too

With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not a thread. Currently, a warp consists of 32 threads.

This means that all 4 of your threads per block (since there is no conditional behavior in this case) are executing in lockstep. When they reach the printf function call, they all execute the call to that function in the same line of code, in lockstep.

So the question becomes, in this situation, how does the CUDA runtime dispatch these "simultaneous" function calls? The answer to that question is unspecified, but it is not "random". Therefore it's reasonable that the order of dispatch for operations within a warp does not change from run to run.

If you launch enough threads to create multiple warps per block, and probably also include some other code to disperse and or "randomize" the behavior between warps, you should be able to see printf operations emanating from separate warps occurring in "random" order.

CUDA thread execution order

Answers (2)

Related Questions