Compute shaders optimal data division on invocations (threads) and workgroups

Question

As far as I understand from OpenGL documentation about compute shader compute spaces, I can divide data space into local invocations (threads) which will execute in parallel and in workgroups which will contain some number of local invocations and they will be executed not parallel (?) in random order, is I'm understand it correctly. Main question is what is the best strategy to divide data, should I always will try to maximize local invocation size and minimize number of workgroups to get better parallel execution or any other strategy will be better (for example I have 10000 elements in data buffer (velocity in x direction maybe) and any of element can be computed independent, how to determine best number of invocations (threads) and workgroups)?

P.S. For everyone who stumbles upon this question, here is an interesting article to read, which might answer your questions https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

Andreas · Accepted Answer

https://www.opengl.org/registry/doc/glspec45.core.pdf

Chapter 19:

A work group is a collection of shader invocations that execute the same code, potentially in parallel.

While the individual shader invocations within a work group are executed as a unit, work groups are executed completely independently and in unspeciﬁed order.

After reading these section quite a few times over I find the "best" solution is to maximize local invocation size and minimize number of work groups because you then tell the driver to omit the requirement of invocation sets being independent. Fewer requirements mean fewer rules for the platform when it parses your intent into an execution, which universially yield better (or the same) result.

An invocation within a work group may share data with other members of the same workgroup through shared variables (see section 4.3.8(“Shared Variables”) of the OpenGL Shading Language Speciﬁcation) and issue memory and control barriers to synchronize with other members of the same work group

Independence between invocations can be derived by the platform when compiling the shader code.

Compute shaders optimal data division on invocations (threads) and workgroups

Answers (1)

Related Questions