Paul Aner
Paul Aner

Reputation: 493

Optimizing compute shaders

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.

I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:

#version 450 core

uniform uint NumParticles;

layout (std430, binding = 0) buffer bla
{
    double rIn[];
};

layout (std430, binding = 1) writeonly buffer bla2
{
    double aOut[];
};


layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;


void main()
{
    int n;
    double dist3, dist2;
    dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);

    a.x = a.y = a.z = 0;
    for (n = 0; n < NumParticles; n++)
    {
        if (n != gl_GlobalInvocationID.x)
        {
            diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
            dist2 = dot(diff, diff);
            dist3 = 1.0 / (sqrt(dist2) * dist2);
            a += diff * dist3;
        }
    }
    aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
    aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
    aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}

I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.

I can provide some code for the above examples, but I don't think, this would be helpful.

I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.

So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.

Upvotes: 2

Views: 1294

Answers (1)

Nicol Bolas
Nicol Bolas

Reputation: 473272

This operation as defined doesn't seem like a good one for GPU parallelism. It's very hungry in terms of memory accesses, as complete processing for one particle requires reading the data for every other particle in the system.

If you want to keep the algorithm as is, you can implement it more optimally. As it stands, each work item does all of the processing for a particular particle all at once. That's a huge number of memory operations happening all at once.

Instead, split your particles into blocks, sized for a work group. Each work group operates on a block of source particles and a block of test particles (which may be the same block). The test particles should be loaded into shared memory, so each work group can repeatedly read test data quickly. So a single work group only does a portion of the tests for each block of source particles.

The big difficulty now is writing the data. Since multiple work groups are potentially be writing the added forces to the same source particles, you need to use some mechanism to either atomically increment the source particle data or write the data to a temporary memory buffer. A second compute shader process can run over the temporary buffers and combine the data in a reduction process.

Upvotes: 2

Related Questions