Reputation: 9292
I have seen code like this in many answers, and the authors say this is branchless:
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
But is this really branchless on current architectures? (x86, ARM...) And is there a real standard guarantee that this is branchless?
Upvotes: 8
Views: 2084
Reputation: 5327
I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…
const struct op {
const char *foo;
int bar;
int flags;
} ops[] = {
{ "foo", 5, 16 },
{ "bar", 9, 16 },
{ "baz", 13, 0 },
{ 0, 0, 0 }
};
extern int foo(const struct op *, int);
int
bar(void *a, void *b, int c, const struct op *d)
{
c |= (a == b) && (d->flags & 16);
return foo(d, c) + 1;
}
… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s
:
[…]
.text
.p2align 2,,3
.globl bar
.type bar , @function
bar:
push %ebp
mov %ebp, %esp
push %ebx
push %eax
mov %eax, DWORD PTR [%ebp+12]
xor %ecx, %ecx
cmp DWORD PTR [%ebp+8], %eax
mov %edx, DWORD PTR [%ebp+16]
mov %ebx, DWORD PTR [%ebp+20]
je .L4
.L2:
sub %esp, 8
or %edx, %ecx
push %edx
push %ebx
call foo
inc %eax
mov %ebx, DWORD PTR [%ebp-4]
leave
ret
.p2align 2,,3
.L4:
test BYTE PTR [%ebx+8], 16
je .L2
mov %cl, 1
jmp .L2
.size bar , . - bar
Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.
Not even using !!(a == b)
makes a difference here.
Check the actual compiler output (assembly with -S
or disassembly with objdump -d -Mintel x.o
; drop the -Mintel
if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.
Upvotes: 0
Reputation: 92976
x86 has the SETcc
family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.
If you use the “naïve” approach
int imax(int a, int b) {
return a > b ? a : b;
}
The compiler would generate even more efficient branch-less code using the CMOVcc
(conditional move) family of instructions.
ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.
Upvotes: 7