IceCool
IceCool

Reputation: 381

Weird performance drop, caused by a single for loop

I'm currently writing an OpenGL 3.1 (with GLSL version 330) application on linux, (NVIDIA 360M card, with the 313.0 nv driver) that has about 15k lines. My problem is that in one of my vertex shaders, I can experience drastical perforamce drops by making minimal changes in the code that should actually be no-op.

For example:

// With this solution my program runs with 3-5 fps
for(int i = 0; i < 4; ++i) {
  vout.shadowCoord[i] = uShadowCP[i] * w_pos;
}

// But with this it runs with 30+ fps
vout.shadowCoord[0] = uShadowCP[0] * w_pos;
vout.shadowCoord[1] = uShadowCP[1] * w_pos;
vout.shadowCoord[2] = uShadowCP[2] * w_pos;
vout.shadowCoord[3] = uShadowCP[3] * w_pos;

// This works with 30+ fps too
vec4 shadowCoords[4];
for(int i = 0; i < 4; ++i) {
  shadowCoords[i] = uShadowCP[i] * w_pos;
}
for(int i = 0; i < 4; ++i) {
  vout.shadowCoord[i] = shadowCoords[i];
}

Or consider this:

uniform int uNumUsedShadowMaps = 4; // edit: I called this "random_uniform" in the original question

// 8 fps
for(int i = 0; i < min(uNumUsedShadowMaps, 4); ++i) {
    vout.shadowCoord[i] = vec4(1.0);
}

// 30+ fps
for(int i = 0; i < 4; ++i) {
  if(i < uNumUsedShadowMaps) {
    vout.shadowCoord[i] = vec4(1.0);
  } else {
    vout.shadowCoord[i] = vec4(0.0);
  }
}

See the entire shader code here, where this problem appeared: http://pastebin.com/LK5CNJPD

Like any idea would be appreciated, about what can cause these.

Upvotes: 4

Views: 1417

Answers (1)

IceCool
IceCool

Reputation: 381

I finally managed to find what was the source of the problem, and also found a solution to it.

But before jumping in right for the solution, please let me paste the most minimal shader code, which with, I could reproduce this 'bug'.

Vertex Shader:

#version 330 

vec3 CountPosition(); // Irrelevant how it is implemented.

uniform mat4 uProjectionMatrix, uCameraMatrix;

out VertexData {
    vec3 c_pos, w_pos;
    vec4 shadowCoord[4];
} vout;

void main() {
    vout.w_pos = CountPosition();
    vout.c_pos = (uCameraMatrix * vec4(vout.w_pos, 1.0)).xyz;
    vec4 w_pos = vec4(vout.w_pos, 1.0);

    // 20 fps
    for(int i = 0; i < 4; ++i) {
        vout.shadowCoord[i] = uShadowCP[i] * w_pos;
    }

    // 50 fps
    vout.shadowCoord[0] = uShadowCP[0] * w_pos;
    vout.shadowCoord[1] = uShadowCP[1] * w_pos;
    vout.shadowCoord[2] = uShadowCP[2] * w_pos;
    vout.shadowCoord[3] = uShadowCP[3] * w_pos;

    gl_Position = uProjectionMatrix * vec4(vout.c_pos, 1.0);
}

Fragment Shader:

#version 330

in VertexData {
    vec3 c_pos, w_pos;
    vec4 shadowCoord[4];
} vin;

out vec4 frag_color;

void main() {
    frag_color = vec4(1.0);
}

And funny thing is that with only a minimal modification of the vertex shader is needed to make both solutions work with 50 fps. The main function should be modified to be like this:

void main() {
    vec4 w_pos = vec4(CountPosition(), 1.0);
    vec4 c_pos = uCameraMatrix * w_pos;

    vout.w_pos = vec3(w_pos);
    vout.c_pos = vec3(c_pos);

    // 50 fps
    for(int i = 0; i < 4; ++i) {
        vout.shadowCoord[i] = uShadowCP[i] * w_pos;
    }

    // 50 fps
    vout.shadowCoord[0] = uShadowCP[0] * w_pos;
    vout.shadowCoord[1] = uShadowCP[1] * w_pos;
    vout.shadowCoord[2] = uShadowCP[2] * w_pos;
    vout.shadowCoord[3] = uShadowCP[3] * w_pos;

    gl_Position = uProjectionMatrix * c_pos;
}

What's the difference is that the upper code reads from the shaders out varyings, while the bottom one saves those values in temporary variables, and only writes to the out varyings.

The conclusion:

Reading a shader's out varying is often seen to be used as an optimisation to get off with one less temporary variable, or at least I have seen it at many places on the internet. Despite of the previous fact, reading an out varying might actually be an invalid OpenGL operation, and might get the GL into an undefined state, in which random changes in the code can trigger bad things.

The best thing about this, is that the GLSL 330 specification doesn't say anything about reading from an out varying, that was previously written into. Probably because it's not something I should be doing.


P.S.

Also note that the second example in the original code might look totally different, but it works exactly same in this small code snippet, if the out varyings are read, it gets quite slow with the i < min(uNumUsedShadowMaps, 4) as condition in the for loop, however if the out varyings are only written, it doesn't make any change in the performace, and the i < min(uNumUsedShadowMaps, 4) one works with 50 fps too.

Upvotes: 3

Related Questions