Reputation: 905
C short-circuits && and ||.
Unlike the bitwise | operator, the || operator guarantees left-to-right evaluation; if the second operand is evaluated, there is a sequence point between the evaluations of the first and second operands. If the first operand compares unequal to 0, the second operand is not evaluated.
We often use this in idioms like
bool xyz = ptr && *ptr == 42;
But suppose we have very simple operands to the || operator:
int a[3] = {1, 2, 3};
[ ... other code ... ]
int myFunc() {
int x = a[0] == 1 && a[1] == 2 && a[2] == 3;
}
It seems that if C must guarantee the latter operands are not evaluated, it has to generate something like (forgive my freehanding of poorly-remembered assembly pseudocode):
xor eax, eax
cmp a+0x0, 0x1 ; a[0] != 1
jne end
cmp a+0x4, 0x2 ; a[1] != 2
jne end
cmp a+0x8, 0x3 ; a[2] != 3
jne end
mov al, 0x1 ; all equal, set 1
.end:
ret
That seems like a lot of branches! My general intuition is that unnecessary branches are bad, because we hit branch prediction issues. (In the past, I have replaced branches with statements like a && b
, thinking it would remove the branch! Granted, I think it also makes the code clearer in most cases.)
Here's what I'd expect the non-branching version would look like, and my intuition says we'd prefer this in most cases. (again, forgive the freehand x86):
xor eax, eax
xor ecx, ecx
cmp a+0x0, 0x1 ; a[0] == 1
sete al
cmp a+0x4, 0x2 ; a[1] == 2
sete cl
and al, cl
cmp a+0x8, 0x3 ; a[2] == 3
sete cl
and al, cl
ret
This feels like it is equivalent and removes the branches. (Though maybe loading a page into cache is "side effects" enough?)
I suspect I could encourage the compiler to generate this assembly by making my &&
into &
? However, in general, my experience is that I should trust the compiler to optimize things better than I can.
Related:
Upvotes: 4
Views: 167
Reputation: 52548
The second example is equivalent to
x = ((a[0] ^ 1) | (a[1] ^ 2) | (a[2] ^ 3)) == 0;
because the array indices are known to be valid and the array elements are initialised. It would be valid for uninitialised array elements if uninitialised elements on a particular implementation behave as if they had an unspecified value, not undefined behaviour.
So evaluation without short circuiting would be possible and would be done by a good compiler if it was faster.
Upvotes: 0
Reputation: 214040
This feels like it is equivalent and removes the branches.
It is not. You execute the right operands of the &&
regardless of what the previous results were. None of the mainstream x86 compilers generates code like that because they respect the "short circuit" behavior.
It's always a touchy subject when the compiler's optimizer changes the meaning of the code, compilers avoid that. Rather they seek to omit things from the C code that fill no purpose and that's also how the C standard itself defines optimization too (from C17 5.1.2.3):
An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced
In your specific case the compiler could omit the right operands of &&
if it can conclude that no side effects are present there. However, it cannot know at compile time which operands that are zero and non-zero, so it will have to execute the code for all operands of the &&
until it hits a zero. And then it must stop, as required by the &&
operator (C17 6.5.13):
If the first operand compares equal to 0, the second operand is not evaluated.
Running your first example on latest gcc x86 -O3
(or equivalent clang), I therefore get similar code to your first asm example:
myFunc:
xor eax, eax
cmp DWORD PTR [rdi], 1
je .L6
.L1:
ret
.L6:
cmp DWORD PTR [rdi+4], 2
jne .L1
xor eax, eax
cmp DWORD PTR [rdi+8], 3
sete al
ret
Changing the C code to int x = 0 && a[0] == 1 && a[1] == 2 && a[2] == 3;
gives the expected optimization to just return zero no matter what:
myFunc:
xor eax, eax
ret
If modifying that expression to contain a side-effect somewhere in the right operands, then it is still not evaluated.
- Is the compiler required to generate "short-circuit" machine code if there are "no" side effects?
It is not allowed to change the meaning of the code regardless of side effects. It cannot execute code that was otherwise left out. It is also not allowed to re-order the evaluations because the &&
operator comes with a sequence point.
- Is there a way to let the compiler decide whether to use an early-out or a non-branching strategy?
- In practice, should this affect how I use & and && in parts of my code?
Well... maybe. &&
is well-known to cause extra branches and using &
is a known way to manually optimize code to reduce branches. Whether or not we should manually optimize code to begin with is the big question. The rule of thumb is that we shouldn't do so unless we have spotted a real bottleneck in the program.
But lets give swapping &&
for &
a try (and note precedence ==
over &
):
int myFunc(int a[])
{
int x = a[0] == 1 & a[1] == 2 & a[2] == 3;
return x;
}
Does this asm look familiar? (gcc 13.3 x86 -O3)
myFunc:
cmp DWORD PTR [rdi], 1
sete dl
cmp DWORD PTR [rdi+4], 2
sete al
and edx, eax
xor eax, eax
cmp DWORD PTR [rdi+8], 3
sete al
and eax, edx
ret
It did the very optimization you suggested. This is much more efficient code.
clang did something even fancier without any cmp
at all:
myFunc: # @myFunc
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rdi + 4]
xor eax, 1
xor ecx, 2
or ecx, eax
mov edx, dword ptr [rdi + 8]
xor edx, 3
xor eax, eax
or edx, ecx
sete al
ret
Upvotes: 1
Reputation: 126253
In practice, should this affect how I use & and && in parts of my code?
No it should not. You should write you code in the clearest and most humanly readable form. Microoptimizations like this should only be undertaken if and when profiling of the code has revealed a performance issue.
Is the compiler required to generate "short-circuit" machine code if there are "no" side effects?
The compiler is required to generate code that has the same visible effects as short-circuit code in the source. As long as it has enough information to know that is the case, it can generate any code it wants.
Some current machines will treat short forward conditional branches like this as essentially "conditional skip" instructions -- that is, they'll execute straight through the whole block of code without executing ANY branches (or consuming branch prediction resources, which are precious) and just supress the instructions that should be skipped by the branches. To take advantage of this, you might see code generated where each 'jne' branches to the next 'jne' instruction -- so it would appear to be quite suboptimal, but when executed turns out to be very fast.
Upvotes: 7