Reputation: 344
I am encountering the following problem when running a CUDA program:
I invoke a simple kernel with a single block that has 2 threads
CUDAkernel<<<1,2>>>
Inside the kernel I do the following:
int i = threadIdx.x;
if (i==0){
waitabit();
}
if (i==1){
waitabit();
}
So, both kernel threads invoke the same function waitabit()
which pretty much wastes some
clock cycles:
__device__ void waitabit(){
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 10000000 )
{break;}
}
}
Now the problem: the function waitabit()
delays the the thread by 0.008 seconds.
I naturally assumed that the threads run in parallel, so both of them will by stalled in parallel by 0.008 seconds (roughly) and the whole kernel's delay will be roughly 0.008 seconds.
However, this is not the case. The kernels executes them serially and the delay is 0.016, i.e. 2*0.008
Is the parallelism done incorrectly?
thanks in advance!
Upvotes: 0
Views: 173
Reputation: 152173
This is a SIMT machine. Only a single instruction is processed by a warp at any given time. In the event of control flow divergence the processing of the if path and the else path are handled sequentially, not in parallel. When all threads of the warp reach your first if statement, thread 0 processes the if path while all other threads do nothing. The warp then resynchronizes at the end of that if construct and begin processing in parallel. Then they hit the second if statement and only thread 1 continues while the others wait. Then they resychronize again at the end of the second if construct and begin processing in lockstep.
So the net effect for your example is that the two if statements are processed sequentially. This is expected.
Upvotes: 2