CUDA independent thread scheduling

Question

Q1: The programming guide v11.6.0 states that the following code pattern is valid on Volta and later GPUs:

if (tid % warpSize < 16) {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
} else {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
}

Why so?

Suppose the if branch gets executed first, when threads 0~15 hit the __shfl_xor_sync statement, they become inactive, and threads 16~31 start executing instructions until they hit the same statement, where the first and second half warps exchange val. Is my understanding correct?

If so, the programming guide also states that "if the target thread is inactive, the retrieved value is undefined" and that "threads can be inactive for a variety of reasons including ... having taken a different branch path than the branch path currently executed by the warp." Doesn't it mean both the if and else branches will get undefined values??

Q2: On GPUs with current implementation of independent thread scheduling (Volta~Ampere), when the if branch is executed, are inactive threads still doing NOOP? That is, should I still think of warp execution as lockstep?

Q3: Is synchronization (such as __shfl_sync, __ballot_sync) the only cause for statement interleaving (statements A and B from the if branch interleaved with X and Y from the else branch)? I'm curious how the current ITS differs from subwarp interleaving.

Robert Crovella · Accepted Answer

Q1:

Why so?

This is an exceptional case. The programming guide doesn't give a complete description of the detailed behavior of __shfl_sync() to understand this case (that I know of), although the statements given in the programming guide are correct. To get a detailed behavioral description of the instruction, I suggest looking at the PTX guide:

shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.

Careful study of that statement may be sufficient for understanding. But we can unpack it a bit.

As already stated, this doesn't apply to compute capability less than 7.0. For those compute capabilities, all threads named in member mask must participate in the exact line of code/instruction, and for any warp lane's result to be valid, the source lane must be named in the member mask and must not be excluded from participation due to forced divergence at that line of code
I would describe __shfl_sync() as "exceptional" in the cc7.0+ case because it causes partial-warp execution to pause at that point of the instruction, and control/scheduling would then be given to other warp fragments. Those other warp fragments would be allowed to proceed (due to Volta ITS) until all threads named in the member mask have arrived at a __shfl_sync() statement that "matches", i.e. has the same member mask and qualifiers. Then the shuffle statement executes. Therefore, in spite of the enforced divergence at this point, the __shfl_sync() operation behaves as if the warp were sufficiently converged at that point to match the member mask.

I would describe that as "unusual" or "exceptional" behavior.

If so, the programming guide also states that "if the target thread is inactive, the retrieved value is undefined" and that "threads can be inactive for a variety of reasons including ... having taken a different branch path than the branch path currently executed by the warp."

In my view, the "if the target thread is inactive, the retrieved value is undefined" statement most directly applies to compute capability less than 7.0. It also applies to compute capability 7.0+ if there is no corresponding/matching shuffle statement elsewhere, that the thread scheduler can use to create an appropriate warp-wide (or member-mask wide) shuffle op. The provided code example only gives sensible results because there is a matching op both in the if portion and the else portion. If we made the else portion an empty statement, the code would not give interesting results for any thread in the warp.

Q2:

On GPUs with current implementation of independent thread scheduling (Volta~Ampere), when the if branch is executed, are inactive threads still doing NOOP? That is, should I still think of warp execution as lockstep?

If we consider the general case, I would suggest that the way to think about inactive threads is that they are inactive. You can call that a NOOP if you like. Warp execution at that point is not "lockstep" across the entire warp, because of the enforced divergence (in my view). I don't wish to argue the semantics here. If you feel an accurate description there is "lockstep execution given that some threads are executing the instruction and some aren't", that is ok. We have now seen, however, that for the specific case of the shuffle sync ops, the Volta+ thread scheduler works around the enforced divergence, combining ops from different execution paths, to satisfy the expectations for that particular instruction.

Q3:

Is synchronization (such as __shfl_sync, __ballot_sync) the only cause for statement interleaving (statements A and B from the if branch interleaved with X and Y from the else branch)?

I don't believe so. Any time you have a conditional if-else construct that causes a division intra-warp, you have the possibility for interleaving. I define Volta+ interleaving (figure 12) as forward progress of one warp fragment, followed by forward progress of another warp fragment, perhaps with continued alternation, prior to reconvergence. This ability to alternate back and forth doesn't only apply to the sync ops. Atomics could be handled this way (that is a particular use-case for the Volta ITS model - e.g. use in a producer/consumer algorithm or for intra-warp negotiation of locks - referred to as "starvation free" in the previously linked article) and we could also imagine that a warp fragment could stall for any number of reasons (e.g. a data dependency, perhaps due to a load instruction) which prevents forward progress of that warp fragment "for a while". I believe the Volta ITS can handle a variety of possible latencies, by alternating forward progress scheduling from one warp fragment to another. This idea is covered in the paper in the introduction ("load-to-use"). Sorry, I won't be able to provide an extended discussion of the paper here.

EDIT: Responding to a question in the comments, paraphrased "Under what circumstances can the scheduler use a subsequent shuffle op to satisfy the needs of a warp fragment that is waiting for shuffle op completion?"

First, let's notice that the PTX description above implies some sort of synchronization. The scheduler has halted execution of the warp fragment that encounters the shuffle op, waiting for other warp fragments to participate (somehow). This is a description of synchronization.

Second, the PTX description makes allowance for exited threads.

What does all this mean? The simplest description is just that a subsequent "matching" shuffle op can/will be "found by the scheduler", if it is possible, to satisfy the shuffle op. let's consider some examples.

Test case 1: As given in the programming guide, we see expected results:

$ cat t1971.cu
#include 
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    printf("thread: %d, swp: %f
", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 0, swp: 16.000000
thread: 1, swp: 17.000000
thread: 2, swp: 18.000000
thread: 3, swp: 19.000000
thread: 4, swp: 20.000000
thread: 5, swp: 21.000000
thread: 6, swp: 22.000000
thread: 7, swp: 23.000000
thread: 8, swp: 24.000000
thread: 9, swp: 25.000000
thread: 10, swp: 26.000000
thread: 11, swp: 27.000000
thread: 12, swp: 28.000000
thread: 13, swp: 29.000000
thread: 14, swp: 30.000000
thread: 15, swp: 31.000000
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
$

Test case 2: remove the body of the else clause. This still "works" because of the allowance for exited threads to satisfy the sync point, but the results are not matching the previous case at all. None of the shuffle ops are "successful":

$ cat t1971.cu
#include 
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    printf("thread: %d, swp: %f
", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 32.000000
thread: 17, swp: 32.000000
thread: 18, swp: 32.000000
thread: 19, swp: 32.000000
thread: 20, swp: 32.000000
thread: 21, swp: 32.000000
thread: 22, swp: 32.000000
thread: 23, swp: 32.000000
thread: 24, swp: 32.000000
thread: 25, swp: 32.000000
thread: 26, swp: 32.000000
thread: 27, swp: 32.000000
thread: 28, swp: 32.000000
thread: 29, swp: 32.000000
thread: 30, swp: 32.000000
thread: 31, swp: 32.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$

Test case 3: Using test case 2, introduce a barrier, to prevent threads from exiting. Now we see a hang on Volta. This is because the sync point associated with the shuffle op can never be satisfied:

$ cat t1971.cu
#include 
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    __syncwarp();
    printf("thread: %d, swp: %f
", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971

Test case 4: Start with test case 2, introduce an additional shuffle op after the conditional area. We see partially correct results in this case. The sync point for the warp fragment encountering the shuffle op in the conditional area is apparently satisfied by the remaining warp fragment encountering the shuffle op outside the conditional area. However, as we shall see, the explanation for the partially correct results is that one warp fragment is doing 2 shuffles, the other only 1. The one that does two shuffles (the lower fragment) has a second shuffle op whose sync point is satisfied due to the exiting thread condition, but whose results are "not correct" because the source lanes are not participating at that point; they have exited:

$ cat t1971.cu
#include 
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    swapped = __shfl_xor_sync(0xffffffff, val, 16);
    printf("thread: %d, swp: %f
", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$

Test case 5: Start with test case 4, introduce a sychronization at the end. Once again we observe a hang. The warp fragment (lower) that is doing 2 shuffle ops, does not have its second shuffle op sync point satisfied:

$ cat t1971.cu
#include 
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    swapped = __shfl_xor_sync(0xffffffff, val, 16);
    printf("thread: %d, swp: %f
", tid, swapped);
    __syncwarp();
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000

The partial printout prior to the hang at this point is expected. It is an exercise left to the reader to explain:

why do we see any print out at all?
why is it the way it is (only the upper fragment, but apparently having correct shuffle results)?

CUDA independent thread scheduling

Answers (1)

Related Questions