dan b
dan b

Reputation: 33

How are tasks divided up with compute shaders?

how does a compute shader divide up tasks?

#version 430 core
layout (local_size_x = 64) in;


layout(std430, binding=4 ) buffer INFO 
{
        vec2 info[];
};


void main()
{

    uint gid = gl_GlobalInvocationID.x;
    info[gid].x += 1.0;
    info[gid].y += 1.0;
    memoryBarrier();
}

in this example, by specifying local_size_x=64, does that mean that each work group will automatically have 64 threads, and since the input is a vec2 array, it knows to just go through main with each vec2 on a separate thread?

Also, what would i do if the shader were to generate 10 vec2 for every vec2 input and then i wanted to do something different with each of those, each on a separate thread. The initial 64 threads would branch into 640. Can this be done in this same shader or would i have to make a second one?

Upvotes: 3

Views: 3575

Answers (1)

Nicol Bolas
Nicol Bolas

Reputation: 474336

in this example, by specifying local_size_x=64, does that mean that each work group will automatically have 64 threads, and since the input is a vec2 array, it knows to just go through main with each vec2 on a separate thread?

Yes, that's how the invocations within a work group are defined.

Also, what would i do if the shader were to generate 10 vec2 for every vec2 input and then i wanted to do something different with each of those, each on a separate thread.

How you do that is entirely up to you. But yes, it would have to be a different shader. Compute shaders cannot create invocations. Not directly.

The purpose of having work items within a work group is to allow those local invocations to communicate with each other and help them compute something. If you don't have any shared variables or barrier calls, then it doesn't really matter what your local size is (well, not from a functionality perspective. Local size can affect performance).

As such, you should pick your local size based on how much work you intend to shove at a particular dispatch operation. Right now, you must process vec2s in integer multiples of 64. If many invocations of the same group are reading the same values, then you need to re-evaluate how much work a full group will do.

The limitation on the number of invocations within a work group is hardware-dependent, but will be no less than 1024. So you've got some room to play with.

In your new system, if you still want a work group invocation to process 64 inputs, then obviously a work group will have to have a local size totaling 640. I would probably suggest a smaller granularity like 8, leaving your local size at a total of 80.

Whatever size you choose, the best way to actually specify this is by using the fact that the local size has multiple dimensions. The X dimension should refer to the input index, with the Y dimension being the output index from the X's input. So the Y size would be 10, with the X size being 8 or 64 or whatever you want.

Therefore, when you go to fetch your input, the index you need is:

const uvec3 size_mult = {1, gl_NumWorkGroups.x, gl_NumWorkGroups.x * gl_NumWorkGroups.y};
const uint input_index = gl_WorkGroupSize.x * dot(gl_WorkGroupID, size_mult) + gl_LocalInvocationID.x;

The index for the output would be:

const uint output_index = gl_WorkGroupSize.y * input_index + gl_LocalInvocationID.y;

Upvotes: 3

Related Questions