Slow compute shader, global vs local work groups?

Question

I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1:

#version 440 core

layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0, RGBA8) uniform image3D voxelTexture;

void main() {

    ivec3 pos = ivec3(gl_GlobalInvocationID);
    vec4 value = imageLoad(voxelTexture, pos);
    if(value.a > 0.0) {
        value.a = 1.0;
        imageStore(voxelTexture, pos, value);
    }
}

I invoke it using the texture dimensions as work group count, size = 128:

opacityFixShader.bind();
glBindImageTexture(0, result.mID, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA8);
glDispatchCompute(size, size, size);
opacityFixShader.unbind();
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

Timing this in RenderDoc using a GTX 1080 Ti results in a whopping 3.722 ms, which seems way too long. I feel like I am not taking full advantage of compute, should I increase the local group size or something?

derhass · Accepted Answer

I feel like I am not taking full advantage of compute, should I increase the local group size or something?

Yes, definitively. An implementation-defined amount of invocations inside each work group will be bundled together as a Warp/Wavefront/Subgroup/Whatever-you-like-to-call-it and executed on the actual SIMD hardware units. For all practical purposes, you should use a multiple of 64 for the local size of the work group, otherwise you will waste lot of potential GPU power.

Your workload will totally be dominated by the memory accesses, so you should also think about optimizing your accesses for cache efficiency. Since you use a 3D texture, I would actually recommend to use a 3D local size like 4x4x4 or 8x8x8 so that you will profit from the 3D data organization your GPU most likely used for internally storing 3D texture data.

Side note:

glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

Are you sure about that. If you are going to sample from the texture afterwards, that will be the wrong barrier.

Also:

I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1

Why are you doing this? This might be a typical X-Y-problem. Spending a whole compute pass on just that might be a bad idea in the first place, and it will never make good use of the compute resources of the GPU. This operation could also potentially be done in the shaders where you actually use the texture, and it might be practically free of cost there because that shader is also very likely to be dominated by the latency of the texture accesses. Another point to consider is that you might access the texture with some texture filtering, and still get alpha values between 0 and 1 even after your pre-process (but maybe you want exactly that, though).

Slow compute shader, global vs local work groups?

Answers (1)

Related Questions