galinette
galinette

Reputation: 9292

Is using the result of comparison as int really branchless?

I have seen code like this in many answers, and the authors say this is branchless:

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

But is this really branchless on current architectures? (x86, ARM...) And is there a real standard guarantee that this is branchless?

Upvotes: 8

Views: 2084

Answers (2)

mirabilos
mirabilos

Reputation: 5327

I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…

const struct op {
    const char *foo;
    int bar;
    int flags;
} ops[] = {
    { "foo", 5, 16 },
    { "bar", 9, 16 },
    { "baz", 13, 0 },
    { 0, 0, 0 }
};

extern int foo(const struct op *, int);

int
bar(void *a, void *b, int c, const struct op *d)
{
    c |= (a == b) && (d->flags & 16);
    return foo(d, c) + 1;
}

… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:

[…]
    .text
    .p2align 2,,3
    .globl   bar
    .type    bar , @function
bar:
    push     %ebp
    mov      %ebp, %esp
    push     %ebx
    push     %eax
    mov      %eax, DWORD PTR [%ebp+12]
    xor      %ecx, %ecx
    cmp      DWORD PTR [%ebp+8], %eax
    mov      %edx, DWORD PTR [%ebp+16]
    mov      %ebx, DWORD PTR [%ebp+20]
    je       .L4
.L2:
    sub      %esp, 8
    or       %edx, %ecx
    push     %edx
    push     %ebx
    call     foo
    inc      %eax
    mov      %ebx, DWORD PTR [%ebp-4]
    leave
    ret
    .p2align 2,,3
.L4:
    test     BYTE PTR [%ebx+8], 16
    je       .L2
    mov      %cl, 1
    jmp      .L2
    .size    bar , . - bar

Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.

Not even using !!(a == b) makes a difference here.

tl;dr

Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

Upvotes: 0

fuz
fuz

Reputation: 92976

x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.

If you use the “naïve” approach

int imax(int a, int b) {
    return a > b ? a : b;
}

The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.

ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.

Upvotes: 7

Related Questions