Reputation: 321
Let's suppose we want to call a global function with the code that follows. Every single thread will have a curandState generator and an array of ints (both properly initialized) that we'll use in order to execute the following code sample:
#define NUMTHREADS 200
int main(){
int * result;
curandState * randState;
if (cudaMalloc(&randState, NUMTHREADS * sizeof(curandState)) == cudaErrorMemoryAllocation ||
cudaMalloc(&result, NUMTHREADS * sizeof(int)) == cudaErrorMemoryAllocation){
cudaDeviceReset();
exit(-1);
}
setup_cuRand <<<1, NUMTHREADS>>> (randState, unsigned(time(NULL)));
method <<<1, NUMTHREADS>>> (state,result);
return 1;
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void generic method(curandState* state, int * result){
curandState localState = state[threadIdx.x];
int num = curand(&localState) % 100;
if(num > 50)
result[threadIdx.x] = threadIdx.x;
else
result[threadIdx.x] = -1;
}
What would be our execution? I mean, do the threads split into both codes magically and re-join later or how it works? are all 1024 threads in execution at once? this last question is because when i'm debugging on Visual Studio 2013, using Cuda Debugger, when i'm going forward, threadIdx.x allways has a value like n*32
and until now i tought that 1024 threads could be executed at the same time and now i'm doubtfull
Upvotes: 0
Views: 83
Reputation: 2916
The test is likely to be transformed into a predicate that will mean conditional assignment of some value in your region of memory. Should your if be more complex, the threads of a warp would magically join after the second part of the if
clause. Depending on predicate for each thread of a warp, a branch might not even get visited.
When entering a breakpoint, the data will be shown for a specific thread/block id. Which thread/block is followed is given by the CUDA Debug Focus setting in NSIGHT for Visual Studio (While debugging with CUDA, enter the NSIGHT menu entry, and select Windows, then CUDA Debug Focus...) By default, thread 0,0,0 will be focused.
Threads are logically executed at the same time, but in practice, you have less than 1024 CUDA-cores per SM. The threads are organized into warps of 32, and warps are scheduled on different execution units by the instruction scheduler. For 1024 threads, that is 32 warps, the first and last warp are not necessarily executed at the same time precisely.
See Memory Fence function in cuda documentation for more details, as well as Synchronization Functions.
Upvotes: 3