What is the exact behaviour of the CUDA execution?

Question

Let's suppose we want to call a global function with the code that follows. Every single thread will have a curandState generator and an array of ints (both properly initialized) that we'll use in order to execute the following code sample:

  #define NUMTHREADS 200
  int main(){

    int * result;
    curandState * randState;

    if (cudaMalloc(&randState, NUMTHREADS * sizeof(curandState)) == cudaErrorMemoryAllocation ||
        cudaMalloc(&result, NUMTHREADS * sizeof(int)) == cudaErrorMemoryAllocation){
         cudaDeviceReset();
         exit(-1);
   }

    setup_cuRand <<<1, NUMTHREADS>>> (randState, unsigned(time(NULL)));
    method <<<1, NUMTHREADS>>> (state,result);
    return 1;
}

__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
    int id = threadIdx.x;
    curand_init(seed, id, 0, &state[id]);
}
__global__ void generic method(curandState* state, int * result){

    curandState localState = state[threadIdx.x];
    int num = curand(&localState) % 100;

    if(num > 50) 
       result[threadIdx.x] = threadIdx.x;
    else
       result[threadIdx.x] = -1;
}

What would be our execution? I mean, do the threads split into both codes magically and re-join later or how it works? are all 1024 threads in execution at once? this last question is because when i'm debugging on Visual Studio 2013, using Cuda Debugger, when i'm going forward, threadIdx.x allways has a value like n*32 and until now i tought that 1024 threads could be executed at the same time and now i'm doubtfull

What is the exact behaviour of the CUDA execution?

Answers (1)

Related Questions