How to properly apply thread synchronization in CUDA app?

Question

Generally I was using thread synchronization very occasionally in my applications because I didn't need this functionality very often. I'm not really advanced C/C++ programmer, however I'm not a beginner too. I started to learn CUDA C excited by the power of nowadays GPU's in compare to the power of CPU's and what I realized is that CUDA programming is mostly about parallel thread execution and that sometimes proper thread synchronization is necessary. In fact I don't even know how to apply thread synchronization in C or C++ yet. The last time I was using synchronization was about 2 years ago whan I was writing simple apps in Java like this:

synchronized returnType functionName(parameters)
{
    ...
}

what allow 'functionName' to be executed by only one thread at a tmie - that is this function is executed alternately by diffrent threads. Now coming back to CUDA C, if I have e.g. 200 threads in a block which run the code inside while loop:

while(some_condition)
{
    ...
}

How can I make threads <0 - 99> synchronized with each other and threads <100 - 199> synchronized with each other too, but apply synchronization the way that threads <0 - 99> and <100 - 199> execute alternately(That is first 100 threads run contents of 'while' and after that next 100 threads run contents of 'while' and so on) ?

Robert Crovella · Accepted Answer

I think you may simply need to learn more about cuda. You may be falling into a trap of thinking that a previous programming paradigm that you learned is something that should be applied here. I'm not sure that's the case.

But to answer your question, first let me point out that thread synchronization in CUDA is only possible within a threadblock. So my comments only apply there.

The principal sync mechanism in device code is __syncthreads(). To use it roughly along the lines you describe, I could code something like this:

__syncthreads();
if (threadIdx.x < 100){
   // code in this block will only be executed by threads 0-99, all others do nothing
  }
__syncthreads();
if ((threadIdx.x > 99) && (threadIdx.x < 200)){
  // code in this block will only be executed by threads 100-199, all others do nothing
  }
// all threads can begin executing at this point

Note that even threads in a threadblock are not all executing in lockstep. The SM (the threadblock processing unit in a CUDA GPU) generally breaks threadblocks into groups of 32 threads called warps and these warps are actually (more or less) executing in lockstep. However the code I listed above still has the effect I describe, in terms of sequencing execution amongst groups of threads, if you wanted to do that for some reason.

How to properly apply thread synchronization in CUDA app?

Answers (1)

Related Questions