Why is this GLSL shader so slow?

Question

I am trying to do a raytrace on a grid in a fragment shader. I have written the shader below to do this (vertex shader just draws a screenquad).

#version 150

uniform mat4 mInvProj, mInvRot;
uniform vec4 vCamPos;

varying vec4 vPosition;

int test(vec3 p)
{
    if (p.x > -4.0 && p.x < 4.0
     && p.y > -4.0 && p.y < 4.0
     && ((p.z < -4.0 && p.z > -8.0) || (p.z > 4.0 && p.z < 8.0)))
        return 1;
    return 0;
}

void main(void) {
    vec4 cOut = vec4(0, 0, 0, 0);

    vec4 vWorldSpace = mInvRot * mInvProj * vPosition;
    vec3 vRayOrg = vCamPos.xyz;
    vec3 vRayDir = normalize(vWorldSpace.xyz);

    // http://en.wikipedia.org/wiki/Xiaolin_Wu%27s_line_algorithm
    vec3 adelta = abs(vRayDir);
    int increaser;
    vec3 gradient, sgradient;
    if (adelta.x > adelta.y && adelta.x > adelta.z)
    {
        increaser = 0;
        gradient = vec3(vRayDir.x > 0.0? 1.0: -1.0, vRayDir.y / vRayDir.x, vRayDir.z / vRayDir.x);
        sgradient = vec3(0.0, gradient.y > 0.0? 1.0: -1.0, gradient.z > 0.0? 1.0: -1.0);
    }
    else if (adelta.y > adelta.x && adelta.y > adelta.z) 
    {
        increaser = 1;
        gradient = vec3(vRayDir.x / vRayDir.y, vRayDir.y > 0.0? 1.0: -1.0, vRayDir.z / vRayDir.y);
        sgradient = vec3(gradient.x > 0.0? 1.0: -1.0, 0.0, gradient.z > 0.0? 1.0: -1.0);
    }
    else 
    {
        increaser = 2;
        gradient = vec3(vRayDir.x / vRayDir.z, vRayDir.y / vRayDir.z, vRayDir.z > 0.0? 1.0: -1.0);
        sgradient = vec3(gradient.x > 0.0? 1.0: -1.0, gradient.y > 0.0? 1.0: -1.0, 0.0);
    }
    vec3 walk = vRayOrg;
    for (int i = 0; i < 64; ++i)
    {
        vec3 fwalk = floor(walk);
        if (test(fwalk) > 0)
        {
            vec3 c = abs(fwalk) / 4.0;
            cOut = vec4(c, 1.0);
            break;
        }
        vec3 nextwalk = walk + gradient;
        vec3 fnextwalk = floor(nextwalk);

        bool xChanged = fnextwalk.x != fwalk.x;
        bool yChanged = fnextwalk.y != fwalk.y;
        bool zChanged = fnextwalk.z != fwalk.z;

        if (increaser == 0)
        {
            if ((yChanged && test(fwalk + vec3(0.0, sgradient.y, 0.0)) > 0)
             || (zChanged && test(fwalk + vec3(0.0, 0.0, sgradient.z)) > 0)
             || (yChanged && zChanged && test(fwalk + vec3(0.0, sgradient.y, sgradient.z)) > 0))
                {
                    vec3 c = abs(fwalk) / 4.0;
                    cOut = vec4(c, 1.0);
                    break;
                }
        }
        else if (increaser == 1)
        {
            if ((xChanged && test(fwalk + vec3(sgradient.x, 0.0, 0.0)) > 0)
             || (zChanged && test(fwalk + vec3(0.0, 0.0, sgradient.z)) > 0)
             || (xChanged && zChanged && test(fwalk + vec3(sgradient.x, 0.0, sgradient.z)) > 0))
                {
                    vec3 c = abs(fwalk) / 4.0;
                    cOut = vec4(c, 1.0);
                    break;
                }
        }
        else
        {
            if ((xChanged && test(fwalk + vec3(sgradient.x, 0.0, 0.0)) > 0)
             || (yChanged && test(fwalk + vec3(0.0, sgradient.y, 0.0)) > 0)
             || (xChanged && yChanged && test(fwalk + vec3(sgradient.x, sgradient.y, 0.0)) > 0))
                {
                    vec3 c = abs(fwalk) / 4.0;
                    cOut = vec4(c, 1.0);
                    break;
                }
        }

        walk = nextwalk;
    }

    gl_FragColor = cOut;
}

As long as I am looking at close grid items, the hardcoded ones, the framerate looks acceptable (400+fps on a Geforce 680M) (although lower than I would expect comparing to other shaders I have written so far), but when I look at emptyness (so the loop goes all the way up to 64), the framerate is terrible (40fps). I get around 1200 fps when looking so close at a grid that every pixel ends up in the same close grid item.

Although I understand that doing this loop for every pixel is some work, it still is some easy basic math, especially now that I have removed the texture-lookup and have just used a simple test, so I don't understand why this has to slow everything down so hard. My GPU has 16 cores and runs at 700+Mhz. I am rendering at 960x540, 518400 pixels. It should be able to handle much more than this I would think.

If I remove the antialiasing part of the above (the part of code where I will test some extra adjacent points based on the increaser value), it is a little better (100fps), but come on, with these calculations, it shouldn't make much difference! If I split the code so that increaser is not used but the below code is done for every different part, the framerate stays the same. If I change some ints to floats, nothing changes.

I have done much more intensive and/or complicated shaders before, so why is this one so terribly slow? Can anyone tell what calculation I do makes it go so slow?

I am not setting uniforms that are not used or something like that, the C-code is also doing nothing more than just rendering. It is code I have used successfully 100s of times before.

Anyone?

scippie · Accepted Answer

The short answer is: branching and looping in shaders is (can be) evil. But it's much more than that: read this topic for further information: Efficiency of branching in shaders

It comes to this:

A graphics adapter has one or more GPU's, and a GPU has several cores. Every core is designed to run multiple threads but those threads can only run the exact same code (depending on the implementation).

So if 10 threads have to do a different loop, those 10 threads will all have to run as long as the largest loop will take to run (depending on the implementation, the loop might be continued further than necessary or the thread might be stalling).

The same with branches: if the thread has an if, it may be possible (depending on the implementation) that both branches are executed and the result of one of them is used.

So, in conclusion, it might be (and probably mostly is) better to do more math and use 0-factors if you want some calculations removed depending on some conditions, than writing the condition itself and branching.

For example:

(using useLighting = 0.0f or 1.0f)
return useLighting * cLightColor * cMaterialColor + (1.0 - useLighting) * cMaterialColor;

might be better than:

if (useLighting < 0.5)
  return cMaterialColor;
else
  return cLightColor * cMaterialColor;

But sometimes it might not... performace-testing is the key...

Why is this GLSL shader so slow?

Answers (1)

Related Questions