Optimizing compute shaders

Question

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.

I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:

#version 450 core

uniform uint NumParticles;

layout (std430, binding = 0) buffer bla
{
    double rIn[];
};

layout (std430, binding = 1) writeonly buffer bla2
{
    double aOut[];
};


layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;


void main()
{
    int n;
    double dist3, dist2;
    dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);

    a.x = a.y = a.z = 0;
    for (n = 0; n < NumParticles; n++)
    {
        if (n != gl_GlobalInvocationID.x)
        {
            diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
            dist2 = dot(diff, diff);
            dist3 = 1.0 / (sqrt(dist2) * dist2);
            a += diff * dist3;
        }
    }
    aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
    aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
    aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}

I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.

I can provide some code for the above examples, but I don't think, this would be helpful.

I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.

So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.

Optimizing compute shaders

Answers (1)

Related Questions