Conditional break performances in for loop for shaders

Question

Having started learning about shaders, one of the first things I read is that (dynamic) conditional branching must be avoided as much as possible for performance reasons: apparently both branches will be run, and then the result is chosen based on the condition.

However, while looking at example shaders, I found this autostereogram shader on Shadertoy. At line 30, in the main function, we can see:

for(int count = 0; count < 100; count++) {
    if(uv.x < pWid)
        break;

    float d = getDepth(uv);
    //d = 1.;

    uv.x -= pWid - (d * maxStep);
}

Here, we have a conditional break in the for loop. Naively, based on the "no conditional branching" above, one would expect it to have terrible performances, as there is one branching for each loop occurence (here 100). However, this is not the case. In fact, increasing the max loop number from 100 to any enormous number has no visible impact on performances.

We can do away with the branching, for example with this code:

for(int count = 0; count < 100; count++) {
    float d = getDepth(uv);
    //d = 1.;

    uv.x -= (pWid - (d * maxStep)) * step(0.0, uv.x-pWid);
}

But then the performances are affected by a bigger loop: at 1000 or 10000, it will slow down to a crawl.

(Similarly, replacing the break with a continue will slow down with bigger loops, though not as much.)

So if it is not running all possible conditional branches, what is happening here exactly? In which cases can I use dynamic branching without such a performance hit?

LJᛃ · Accepted Answer

With modern GPUs the total workload of (in this case) pixels that need shading is divided into tiles/groups, the tiles are then processed by a "warp"/"wavefront"¹ which is a group of threads that run in lock-step meaning that all threads within a warp that processes a tile run the same instructions but with different data (SIMD).

Imagine having a warp that processes 2x2 pixels, three of your pixels need 10 iterations, but the fourth one needs 100, so all threads run 100 iterations, the result of the superfluous 90 iterations for your first three pixels will be "masked"/discarded so it doesn't affect their output but the whole warp may only move on to the next pixels when all threads are done processing. However when all threads exit after 10 iterations the warp may bail out and move on earlier, hence you'll have a performance gain.

You can find a (slightly) lengthier explanation of the above here.

A glimpse into the inner workings of scheduling / tiling on modern GPUs here.

¹ "warp" is NVIDIA, "wavefront" AMD lingo

Conditional break performances in for loop for shaders

Answers (1)

Related Questions