Reputation: 478
The OpenGL extension GL_ARB_shader_group_vote provides a mechanism to group different shader invocations with the same value for a user-defined boolean condition, such that all invocations inside that group only need to evaluate one - the same - branch of a conditional statement. For example:
if (anyInvocationARB(condition)) {
result = do_fast_path();
} else {
result = do_general_path();
}
So there is a potential performance gain here, because the invocations can be grouped beforehand such that all do_fast_path-candidates can be executed faster than the rest. However, I could not find any information to when this mechanism is actually useful and whether it could even be harmful. Consider a shader with a dynamically uniform expression:
uniform int magicNumber;
void main() {
if (magicNumber == 1337) {
magicStuff();
} else {
return;
}
}
In this case, does it make sense to replace the condition by anyInvocationARB(magicNumber == 1337)
? Since the flow is uniform, it could already be detected that only one of the two branches will ever need to be evaluated across all shader invocations. Or is this an assumption the SIMD processor must not make for any reason? I am using a lot of branching based on uniform values in my shaders and it would be interesting to know whether I could actually benefit from this extension or whether it could even decrease the performance because I inhibit uniform flow optimizations. I have not profiled this myself (yet), so it would be good to know beforehand what experiences others have made, this could spare me some troubles.
Upvotes: 2
Views: 816
Reputation: 444
I am disatisfied with the only answer so I shall elaborate.
Simply adding "allInvocationsARB" on its own won't improve performance (Update: Yes it can, see bottom of answer).
As OP says, the GPU will already perform the skip if none of the threads in a wavefront are true.
So how does allInvocationsARB help improve performance?
First you need to change your algorithm. I am going to use an example.
Let's suppose you have 64 items to work. And one threadgroup (aka wavefront aka warp) of 64x1x1 threads.
The original compute shader looks like this:
void main()
{
for( int i=0; i<64; ++i )
{
doExpensiveOperation( data[i], outResult[gl_GlobalInvocationID.x * 64u + i] );
}
}
That is, we call 64 threads, which iterate 64 times each; thus producing an output of 4096 results.
But there's a quick way to check if we should skip doing that expensive operation. So instead we optimize it:
void main()
{
for( int i=0; i<64; ++i )
{
if( needsToBeProccessed( data[i] ) )
doExpensiveOperation( data[i], outResult[gl_GlobalInvocationID.x * 64u + i] );
}
}
But here's the problem: Let's say that needsToBeProccessed returns false for all 64 work items.
The entire wavefront will perform 64 iterations and skip the expensive operation 64 times.
There is better way to solve this. And it is by forcing beforehand each thread to work on a single item:
bool cannotSkip = needsToBeProccessed( data[gl_LocalInvocationIndex], gl_LocalInvocationIndex );
Here, we use gl_LocalInvocationIndex instead of i. This way, each thread reads 1 work item.
Now when we use this change plus anyInvocationARB and we end up with:
void main()
{
bool cannotSkip = needsToBeProccessed( data[gl_LocalInvocationIndex], gl_LocalInvocationIndex );
if( anyInvocationARB( cannotSkip ) )
{
for( int i=0; i<64; ++i )
{
if( needsToBeProccessed( data[i] ) )
doExpensiveOperation( data[i], outResult[gl_GlobalInvocationID.x * 64u + i] );
}
}
}
Because needsToBeProccessed returned false for all the threads, anyInvocationARB will return false.
In the end, the shader ended up calling needsToBeProccessed() just once instead of 64 times.
And this is how we speed up processing time.
This only works if we're more or less certain that most of the time, anyInvocationARB will return false.
If it always returns true, then we'll just end up with a slightly slower compute shader because now needsToBeProccessed will be called 65 times (not 64), and doExpensiveOperation will be called 64 times.
Update: I realized I made a mistake at the beginning: Simply adding "allInvocationsARB" on its own CAN improve performance.
This is because without it, you're performing a dynamic branch. Whereas when allInvocationsARB is used, a static branch is used. What's the difference?
Consider the following example:
void main()
{
outResult[gl_LocalInvocationIndex] = 0;
if( gl_LocalInvocationIndex == 0 )
outResult[gl_LocalInvocationIndex] = 5;
}
This is a dynamic branch.
The GPU MUST guarantee at the end of dispatch that outResult[0] == 5 and that for all other elements outResult[i] == 0.
That is, the GPU must track (aka execution mask) which threads are active in the branch and which aren't. The inactive threads in the wavefront will execute the instructions, but their result will be masked out, as if it never happened.
Now let's see what happens if we add anyInvocationARB:
void main()
{
outResult[gl_LocalInvocationIndex] = 0;
if( anyInvocationARB( gl_LocalInvocationIndex == 0 ) )
outResult[gl_LocalInvocationIndex] = 5;
}
Now this is very interesting because the result will be GPU specific:
Let's suppose the thread group size is 64x1x1.
Now:
But more importantly, this is a static branch and therefore the GPU does not have the overhead of dynamic branches which require tracking inactive threads to mask out the results. Therefore simply adding anyInvocationARB() can improve performance but please note that it can also affect the result in GPU-specific ways if you're not careful.
There are cases where it does not matter, for example if you're sure that running the code on all the values will always produce the same result.
For example:
void main()
{
outResult[gl_LocalInvocationIndex] = 5;
isDirty[gl_LocalInvocationIndex] = false;
if( gl_LocalInvocationIndex == 0 )
{
outResult[0] = 67;
isDirty[0] = true;
}
if( anyInvocationARB( isDirty[gl_LocalInvocationIndex] ) )
outResult[gl_LocalInvocationIndex] = 5;
}
In this case, the nature of our code and algorithm guarantees that after the dispatch outResult[i] == 5 regardless of whether anyInvocationARB is present. And anyInvocationARB can thus be used to improve performance by using static branches instead of dynamic branches.
Of course while simply adding anyInvocationARB can indeed improve performance, the best way to make huge improvements is by taking advantage of it in the way described during the first half of this answer.
Upvotes: 2
Reputation: 26569
No, there's no point.
Read the description of the extension again:
Compute shaders operate on an explicitly specified group of threads (a local work group), but many implementations of OpenGL 4.3 will even group non-compute shader invocations and execute them in a SIMD fashion. When executing code like
if (condition) { result = do_fast_path(); } else { result = do_general_path(); }
where diverges between invocations, a SIMD implementation might first call do_fast_path() for the invocations where is true and leave the other invocations dormant. Once do_fast_path() returns, it might call do_general_path() for invocations where is false and leave the other invocations dormant. In this case, the shader executes both the fast and the general path and might be better off just using the general path for all invocations.
So modern GPU's don't necessarily jump; they may instead execute both sides of the if
expression, enabling or disabling writes on the tasks that pass or fail the condition, except if all of the tasks chose one side of the branch.
This implies two things:
*Invocations
functions on dynamically uniform expressions is useless, since they evaluate to the same value on every task.allInvocationsARB
for the fast path condition, as one of the tasks may need to go through the general path.Upvotes: 0