Blue
Blue

Reputation: 193

CUDA thread execution order

I have the following code for a CUDA program:

#include <stdio.h>

#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4

__global__ void hello()
{  

   printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);

}


int main(int argc,char **argv)
{
    // launch the kernel
    hello<<<NUM_BLOCKS, THREADS_PER_BLOCK>>>();

    // force the printf()s to flush
    cudaDeviceSynchronize();

    return 0;
}

in which every thread will print its threadIdx.x and blockIdx.x. One possible output of this program is this:

Hello. I'm a thread 0 in block 0
Hello. I'm a thread 1 in block 0
Hello. I'm a thread 2 in block 0
Hello. I'm a thread 3 in block 0
Hello. I'm a thread 0 in block 2
Hello. I'm a thread 1 in block 2
Hello. I'm a thread 2 in block 2
Hello. I'm a thread 3 in block 2
Hello. I'm a thread 0 in block 3
Hello. I'm a thread 1 in block 3
Hello. I'm a thread 2 in block 3
Hello. I'm a thread 3 in block 3
Hello. I'm a thread 0 in block 1
Hello. I'm a thread 1 in block 1
Hello. I'm a thread 2 in block 1
Hello. I'm a thread 3 in block 1

Running the program several times I get similar results, except that block order is random. For example, in the above output we have this block order 0, 2, 3, 1. Running the problem again I get 1,2,3, 0. This is expected. However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too.

I tried to change my code to force thread 0 in every block to take longer to execute. I did it like this:

__global__ void hello()
{  

    if (threadIdx.x == 0)
    {
        int k = 0;
        for ( int i = 0; i < 1000000; i++ )
        {
            k = k + 1;
        }
    }

   printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);

}

I would expect thread order to be 1,2,3, 0. However, I got a similar result to the one I have shown above where thread order was always 0, 1, 2, 3. Why is this happening?

Upvotes: 1

Views: 1892

Answers (2)

Jeff Ames
Jeff Ames

Reputation: 1446

To answer the second part of your question, when control flow diverges at the if statement, the threads where threadIdx.x != 0 simply wait to at the convergence point after the if statement. They do not go on to the printf statement until thread 0 has completed the if block.

Upvotes: 1

Robert Crovella
Robert Crovella

Reputation: 151879

However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too

With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not a thread. Currently, a warp consists of 32 threads.

This means that all 4 of your threads per block (since there is no conditional behavior in this case) are executing in lockstep. When they reach the printf function call, they all execute the call to that function in the same line of code, in lockstep.

So the question becomes, in this situation, how does the CUDA runtime dispatch these "simultaneous" function calls? The answer to that question is unspecified, but it is not "random". Therefore it's reasonable that the order of dispatch for operations within a warp does not change from run to run.

If you launch enough threads to create multiple warps per block, and probably also include some other code to disperse and or "randomize" the behavior between warps, you should be able to see printf operations emanating from separate warps occurring in "random" order.

Upvotes: 3

Related Questions