Conditional XOR after Bit Test in Assembly

I'm trying to make a bit of inline assembly that tests a bit at a given position of input b and if its a 1, it replaces b with a XOR b.

I have the error "operand size mismatch for bt".

When I compile with -O3, sometimes it looks like it works as intended. Its completely inconsistent though. Including sometimes computing correctly, and sometimes gives an error at compile time. (all with -O3).

Without -O3 it always gives an error at compile time.

The overall requirement for this is to be as fast as possible, and workable on most modern-day AMD64 processors.

unsigned long ConditionAdd(const unsigned long a, unsigned long b, const long pos) {

  // Intended behavior: Looks at bit in position pos in b: if that bit is a 1, then it replaces b with a xor b
  asm (
       "BT %[position], %[changeable] \n\t"
       "JNC to_here%=\n\t"
       "XOR %[unchangeable], %[changeable]\n\t"
       "to_here%=: "
       : [changeable] "=&r" (b)
       : [position] "mr" (pos), [unchangeable] "mr" (a), "[changeable]" (b)
       : "cc"
       );

  return b;
}

Upvotes: 1

Answers (1)

Peter Cordes

Reputation: 365832

Without -O3 it always gives an error at compile time.

You gave the compiler the option of picking memory for the bt source operand. With optimization disabled, it does, so the result doesn't assemble. bt $imm, r/m or bt %reg, r/m are the only forms encodeable. (In Intel syntax like the manual, bt r/m, imm or bt r/m, reg).

Fortunately you didn't give bt the option of picking a memory destination for bt because it has crazy-CISC bitstring semantics that make it very slow in the reg, mem case, like 5 uops on Ryzen, 10 uops on SnB-family. (https://agner.org/optimize/ and/or https://uops.info). You always want the compiler to load the operands into registers first.

You do only read a at most once, so it's reasonable to have it in memory. OTOH, clang always picks "m" if you give it the choice of "rm" or "mr". This is a known missed-optimization bug, but if you care about clang it's generally a good idea not to give the compiler that option or it will spill a register ahead of an asm statement.

Don't forget to allow immediates for a and pos. xor takes a 32-bit sign-extended immediate so you want an "e" constraint. A 64-bit shift count can use a "J" constraint to restrict to immediates in 0..63, or you can let bt mask (aka modulo) the source operand for you even when it's an immediate. But it's an assemble-time error for GAS to use a bt immediate that doesn't fit in an imm8. So you could use that to detect compile-time-constant pos that's much too large by just using an i constraint. You could also imagine doing .ifeq asm macro stuff to do %[pos] & 63 when %[pos] is numeric, otherwise just %[pos].

BTW, you might as well use a simpler [changeable] "+&r"(b) read/write constraint instead of an output + matching constraint. That's a more compact way of telling the compiler exactly the same thing.

But also, you don't need an early-clobber. bt doesn't modify any integer registers, only EFLAGS, so no registers are written before the final read of an input-only operand. If a and b are known to hold the same value, it's totally fine for the resulting asm to have xor same,same (a zeroing idiom) as the fall-through path.

unsigned long ConditionAdd(const unsigned long a, unsigned long b, const long pos) {

  // if (b & (1UL<<pos)) b ^= a;
  asm (
       "BT    %[position], %[changeable] \n\t"
       "JNC   to_here%=\n\t"
       "XOR   %[unchangeable], %[changeable]\n\t"
       "to_here%=: "
       : [changeable] "+r" (b)
       : [position] "ir" (pos), [unchangeable] "er" (a)
       : "cc"      // cc clobber is already implicit on x86, but doesn't hurt
       );

  return b;
}

The overall requirement for this is to be as fast as possible, and workable on most modern-day AMD-64 processors.

Then you might not want a conditional branch at all. Branchy code depends on the branch predicting well. Modern branch predictors are very fancy, but if the branch isn't correlated with preceding branches or some other pattern you're screwed. Profile with performance counters using perf for branches and branch-misses.

You might still want inline asm, though, because gcc is generally terrible at using bt to optimize stuff like a & (1ULL << (pos&63)), or btc/s/r for the ^= / |= / &= ~ versions of it. (clang is better sometimes, including here).

You might want a bt / cmov / xor to conditionally zero a tmp copy of a before XORing, because 0 is the additive / xor identity value: b ^ 0 == b. Creating a zeroed reg for cmov is probably better than creating a 0 / -1 in a reg (e.g. with bt / sbb same,same / and %tmp, %a / xor %a, %b). On Broadwell and later, and on AMD, cmov is only 1 uop. And xor-zeroing can be off the critical path for latency. Only AMD has dependency-breaking sbb same,same, and on Intel that's 2 uops if cmov is 2 uops.

Shifting the relevant bit to the top of a register to broadcast it with sar reg, 63 is another option, but then you need 63-pos in a register, I think. That's still going to be worse than cmov for branchless.

But simply using cmov to select between b and b^a makes the most sense, especially if it's ok to destroy the register that you got a in. (i.e. force the compiler to mov copy it if it wants to keep it around)

But did you try using pure C to let the compiler inline and optimize? Especially if constant-propagation is possible. https://gcc.gnu.org/wiki/DontUseInlineAsm Or if it's possible to auto-vectorize this with SIMD, with a, b, and pos coming from arrays?

(You can fall back to pure C when some inputs are compile-time constants using __builtin_constant_p. Except in clang7.0 and earlier, where it's evaluated before inlining so it's always false for wrapper functions.)

IDK if it was intentional that you picked unsigned long, which is 32-bit in the Windows ABI, 64-bit on x86-64 System V. And 32-bit in 32-bit mode. Your asm is width-agnostic (16, 32, or 64-bit. bt doesn't have an 8-bit operand-size version). If you meant definitely 64-bit then use uint64_t.

// you had an editing mistake in your function name: it's Xor not Add.
unsigned long ConditionXor(const unsigned long a, unsigned long b, const long pos) {
    if ((b>>pos) & 1) {
        b ^= a;
    }
    return b;
}

compiles with clang9.0 on the Godbolt compiler explorer to

# clang9.0 -O3  -Wall -march=skylake
ConditionXor(unsigned long, unsigned long, long)
        xorl    %eax, %eax          # rax=0
        btq     %rdx, %rsi
        cmovbq  %rdi, %rax          # rax =  CF ? a : 0
        xorq    %rsi, %rax          # rax = b ^ (a or 0)
        retq

But GCC it compiles basically as-written. Although it will optimize b & (1UL<<pos) to (b>>pos) & 1 for you. BMI2 for shlx (single-uop variable-count shift instead of 3 uops on Intel for shr %cl, %reg) helps, so I used a -march that included it. (Haswell or newer / Ryzen or newer)

# gcc9.2 -O3 -march=haswell
ConditionXor(unsigned long, unsigned long, long):
        xorq    %rsi, %rdi        # a^b
        movq    %rsi, %rax        # retval = b
        shrx    %rdx, %rsi, %rdx  # tmp = b>>pos
        andl    $1, %edx          # test  $1, %dl might be better
        cmovne  %rdi, %rax        # retval = (b>>pos)&1 ? a^b : b
        ret

The shrx+andl is equivalent to BT, except it sets ZF instead of CF.

If you can't enable BMI2 and have to use GCC, then inline asm for bt is probably a good idea.

Otherwise use clang and get near-optimal branchless asm.

I think clang could do better by pre-calculating b^a and using cmov to select between that an b, shortening the critical path by 1 cycle because that can happen in parallel with bt.

# hand-written probably-optimal branchless sequence.  Like GCC but using BT
        xor    %rsi, %rdi    # destroy tmp copy of a, in a "+&r"(tmp)
        bt     %rdx, %rsi
        mov    %rsi, %rax    # mov to an "=r"(output)
        cmovc  %rdi, %rax
    ret

Note the ILP: the first 3 instructions all depend purely on the inputs. (bt doesn't write its destination reg). They can all execute in the first cycle after RSI becomes ready. Then all 3 inputs for CMOV are ready the next cycle (RDI, RAX, and CF). So total latency = 2 cycles from any / each of the inputs, assuming single-uop CMOV.

You can easily turn that back into inline-asm using a tmp reg, and turning the hard-coded regs back into %[name] operands. Make sure you tell the compiler that the a input is destroyed by making it read/write. You can use separate "r"(b) input and "=r"(b) outputs, letting the compiler pick different registers if it wants. You don't need a matching constraint.

Middle ground: `asm` only for `bt`

Consider using inline asm only to wrap bt with a flag output operand, "=@ccc"(cf_output), and leave the rest to the compiler. This could let you get better asm from gcc without having to put the whole thing into asm. This allows constprop and other possible optimizations for a and a^b, and gives GCC flexibility when layout out the function. e.g. it doesn't have to do everything in one small block, it can interleave with other work. (Not a big deal given OoO exec).

Upvotes: 1

Conditional XOR after Bit Test in Assembly

Answers (1)

Middle ground: asm only for bt

Related Questions

Middle ground: `asm` only for `bt`