Reputation: 89

How to get BSR instruction to work on 64 bits?

I am trying to find the leading bit of a unsigned 64bit int. I am using BSR as my processor doesn't have the LZCNT instruction. My problem is that once the input is exactly 2^32, it returns 2^64 as the leading bit value, then cycles back through outputs up until 2^64.

This is my code:

unsigned long int LeadingBit(unsigned long int a) {
  if(a==0)
    return 0;
  unsigned long int nlb;
  asm (
       "BSR %1, %0   \n"
       : "=r" (nlb)
       : "mr" (a)
       : "cc"
       );
  return 1<<nlb;

}

The intention of this code is to be able to input a 64-bit integer, and have it return the value of the position of the leading 1. e.g: a = 65 (1000001) returns 1000000.

Upvotes: 6

Answers (3)

Peter Cordes

Reputation: 365277

TL:DR: you should only use inline asm if you're going to optimize the whole thing and beat the compiler. Otherwise use intrinsics like __builtin_clzll or GCC-only __builtin_ia32_bsrdi

Your actual problem was unrelated to inline asm.

1 has type int, and unlike most operators, << doesn't promote the left side to match the right side. So you're shifting an int by more than 31 bits, which is undefined behaviour. In practice you get a 32-bit operand-size shift, which you could have noticed by looking at the compiler's asm output. (Generally a good idea whenever you're using inline asm.)

You don't need inline asm for this; it's easy to express using GNU C builtins.

But if performance is critical, you might want to use inline asm to work around compiler missed-optimizations for current microarchitectures in the shift part, especially if you're compiling without BMI2 for efficient variable-count shifts on Intel CPUs.

Note that bsr is quite slow on AMD CPUs while lzcnt is fast on all CPUs that support it. On Intel, both are fast. But unlike for bsf/tzcnt, the instructions produce different results even for non-zero inputs so using rep bsr to get a fast lzcnt on newer CPUs wouldn't be useful the way it sometimes is for tzcnt. (tzcnt runs as bsf on older CPUs, lzcnt runs as bsr on older CPUs. That's because the encoding for lzcnt is a REP prefix in front of bsr)

Unfortunately bsr is quite slow on Atom/Silvermont, too. Like on Goldmont Plus: 11 uops, 9 cycle latency, and 8 cycle throughput.

See my answer on Find most significant bit (left-most) that is set in a bit array for a rundown of how various compilers are dumb with 63-__builtin_clzll(x); which could optimize back to just a bsr, but doesn't.

GCC specifically (not clang) has an intrinsic __builtin_ia32_bsrdi for 64-bit bsr, and both support _bit_scan_reverse for 32-bit.

The obvious way to write this is this portable code that works on any target supported by GCC:

uint64_t LeadingBitIsolate_gnuc(uint64_t a)
{
  static_assert( CHAR_BIT * sizeof(unsigned long) == 64, "__builtin_clzll isn't 64-bit operand size");

  uint64_t bit = 63-__builtin_clzll(a);      // BSR
  return a ? (1ULL << bit) : 0;              // ULL is guaranteed to be at least a 64-bit type
}

If a is ever a compile-time constant, constant-propagation through this function works. Using inline asm always defeats that unless you use something like if(__builtin_constant_p(a)) { non-asm version } else { asm version}. (https://gcc.gnu.org/wiki/DontUseInlineAsm)

I used uint64_t for portability to x86-64 targets where unsigned long is a 32-bit type. (Linux x32 (an ILP32 ABI), and MS Windows x64). Also to non-x86 targets since this doesn't use any inline asm.

Unfortunately this compiles to pretty bad asm with gcc and clang (Godbolt)

# gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_gnuc(unsigned long):
        movq    %rdi, %rax
        testq   %rdi, %rdi
        je      .L4

        bsrq    %rdi, %rax
        movl    $63, %ecx
        xorq    $63, %rax      # bit = 63 - bit   in the low 6 bits, 2's complement bithack
        subl    %eax, %ecx     # bit = 63 - bit

        movl    $1, %eax       # 1 uop
        salq    %cl, %rax      # 3 uops
.L4:
        ret

Using BSR64 = __builtin_ia32_bsrdi from my answer on Find most significant bit (left-most) that is set in a bit array we get this from GCC (and similar using cmov from clang and ICC). ICC/MSVC provide intrinsics that return a bool for the ZF output from BSR. test/je has the same cost as je on modern x86, but saving the test instruction in front of cmov does have value.

// See the linked answer for the ICC/MSVC part of this
#ifdef __GNUC__
  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
      // update: clang8.0 regressed here but still doesn't do __builtin_ia32_bsrdi
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)
#endif

uint64_t LeadingBitIsolate_gnuc_v2(uint64_t a)
{
  uint64_t bit = BSR64(a);
  return a ? (1ULL << bit) : 0;
}

This compiles better with GCC, at least for the BSR part. But still not for the shift part.

# gcc -O3 -mtune=skylake
LeadingBitIsolate_gnuc_v2(unsigned long):
        movq    %rdi, %rax
        testq   %rdi, %rdi
        je      .L9
        bsrq    %rdi, %rcx
        movl    $1, %eax
        salq    %cl, %rax       # missed optimization: xor-zero + bts would be much better especially with tune=skylake
.L9:
        ret

On the plus side here, we've optimized out most of the BSR overhead by using a BSR intrinsic. And it can still compile to LZCNT on CPUs where that's a better choice, e.g. with clang -march=znver1 (Ryzen). GCC still uses BSR but clang uses 63-lzcnt which will run faster on AMD.

63-__builtin_clzll is portable to non-x86 but __builtin_ia32_bsrdi isn't. (DI = double-word integer = 64-bit).

But the compiler does still know what's happening, and can optimize this into surrounding code, and for compile-time-constant inputs.

E.g. it can branch over other things, too, if the input=0, only testing that once. And it knows that the input only has a single bit set, so in theory if you're using it as a divisor a compiler could be smart enough to AND with res-1 instead of using a slow div instruction. (Division by zero is UB so the compiler can assume that didn't happen, maybe even optimizing away the a==0 ? part of this after inlining.)

Compiling this with BMI2 available for SHLX would make it efficient on modern Intel.

Using inline asm to work around missed optimizations for Sandybridge-family

SnB-family has slower-than-usual variable-count shifts: instead of potential stalls if the flag-result is read, it includes 2 extra uops to check if the count is zero and conditionally merge the shift's flag-result into FLAGS. (shl %cl, %reg has to leave FLAGS unmodified if cl=0. This is the kind of legacy baggage that's part of the "x86 tax" that high performance x86 CPUs have to pay to do superscalar out-of-order execution.)

AMD apparently manages shifts as a single uop without limitations / penalties.

Thus bts into a zeroed register is a better choice for implementing 1ULL << count, especially on Intel CPUs where xor-zeroing is extra cheap, and bts reg,reg is a single uop (instead of 2 on Ryzen).

Compared to the version in @DavidWohlferd's answer (based on my comments), this saves instructions by using a as the zero in the case where a==0, instead of needing an extra zeroed register. And the comments talk about performance implications.

#include <stdint.h>
#include <limits.h>

uint64_t LeadingBitIsolate_asm(uint64_t a)
{
  uint64_t bit;
  uint64_t bts_target = 0;
  asm (
       "BSR %[a], %[bit]  \n\t"    // ZF = (a==0)
       "BTS %[bit], %[res] \n\t"   // sets CF = old bit = 0 but not ZF
         // possible partial-flag stall on some CPUs for reading ZF after writing CF
         // but not SnB-family
       "CMOVZ %[a], %[res]"        // res = a (= 0) if BSR detected a==0
       : [bit] "=r" (bit), [res] "+r"(bts_target)  // no early clobber: input read before either output written
       : [a] "r" (a)
       : "cc"              // redundant but not harmful on x86: flags clobber is always implied
       );

    return bts_target;
}

Reading ZF after BTS writes CF is dangerous for older Intel CPUs, where it will cause a partial-flag stall. See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

GCC 9.2 gives us a sequence that's only 4 uops on Skylake. (https://agner.org/optimize/). The xor-zeroing needs a front-end uops, but on Sandybridge-family doesn't need a back-end execution unit (0 unfused domain uops).

It's 5 uops on Haswell and earlier where cmov is 2 uops / 2c latency. Since you say your CPU doesn't support lzcnt, you might have an IvyBridge or earlier (2 uop cmov), or an old AMD.

Branching might be worth it if input=0 is expected to be rare or never happen, especially if this is part of a critical path for latency. Especially on older Intel where cmov is 2 uops.

gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_asm(unsigned long):
        xorl    %eax, %eax
        BSR %rdi, %rdi  
        BTS %rdi, %rax 
        CMOVZ %rdi, %rax
        ret

A "rm"(a) input constraint would be possible, but

we read it twice so it's better to have the compiler load into a register once
clang is dumb and always picks m when it's an option, even if that means spilling a value that was already in a register. clang (LLVM) inline assembly - multiple constraints with useless spills / reloads
bsr has a "false" dependency on existing CPUs, so it's better if the compiler happens to pick the same register for res and a instead of writing a new register. This is what usually happens if a isn't needed later.

BSR isn't really a false dependency: if the input is zero, the destination is left unmodified. AMD even documents this, but Intel implements it in their hardware while leaving their documentation saying "undefined" contents.

We can't take advantage of this anyway, we have 65 possible outputs and BTS into a zeroed reg can only produce 64 different outputs.

You might be tempted to use rcl (rotate-through-carry) on a register holding 1 but first of all rcl %cl, %reg is quite slow, and 2nd it still masks the shift with & 63 so it can't shift a 1 all the way out into CF.

No partial-flag stalls, and shorter critical path on older Intel

Using BTC and taking advantage of BSR zero-input behaviour we can make a version that's possibly better on older Intel P6-family, like Nehalem and Core 2 (where cmov is 2 uops and partial-flag stalls are a problem).

It doesn't save uops though because it requires an extra xor-zeroed register. It does shorten the critical path latency, and could shorten it even more if we used test/setz in parallel with 3-cycle-latency bsr, instead of using the ZF output from BSR. Or if the compiler already did math on the input, it might already have ZF set appropriately. Unfortunately there's no way to query that.

btc = complement = flip a bit, instead of unconditionally setting. This can create a zero result if the destination register is 1 instead of 0, if we know the bit-index for that case.

xor-zero / set flags / setcc is a standard idiom, generally at least as efficient as set flags / setcc / movzx because xor-zeroing can be even cheaper than movzx, and it's off the critical path for latency. setcc is a 1-uop instruction on all x86-64 CPUs.

// depends on BSR behaviour that only AMD documents, but all real CPUs implement (for now?)
// it seems unlikely that this will change, though.
// The main advantage of this version is on P6-family and early Sandybridge
// where it definitely does work safely.
uint64_t LeadingBitIsolate_asm_p6(uint64_t a)
{
  uint64_t btc_target;
  uint64_t bit = 0;
  //bool input_was_zero;
  asm (
       "xor   %k[res], %k[res]\n\t"  // make sure we avoid P6-family partial-reg stalls with setz + reading full reg by forcing xor-zeroing, not MOV
       "bsr   %[a], %[bit]  \n\t"    // bit=count or unmodified (if a==0)
       "setz  %b[res]\n\t"           // res = (a==0)

       "btc   %[bit], %[res] \n\t"   // flip a bit.  For a==0, flips bit 0 which SETZ set to 1
       : [bit] "+r" (bit), [res] "=&q"(btc_target) // q = reg accessible as low-8, like RAX has AL.  Any x86-64 reg
          // early-clobber: [res] is written before a, but [bit] isn't.
          // ,"=@ccz" (input_was_zero)  // optional GCC6 flag output.  Or "=@ccc" to read from CF (btc result) and avoid partial-flag stalls
       : [a] "r" (a)
       : "cc"
       );

    return btc_target;
    //return input_was_zero;
}

GCC9 and trunk has a weird register-allocation missed optimization where it produces res in r8 and has to mov it back to RAX.

gcc8.3 -O3 -mtune=nehalem
LeadingBitIsolate_asm_p6(unsigned long):
        xorl    %edx, %edx        # compiler-generated
        xor   %eax, %eax          # inline asm
        bsr   %rdi, %rdx  
        setz  %al
        btc   %rdx, %rax 

        ret

On Intel CPUs like Core2, Nehalem, and Sandybridge-family, this is 5 uops with no stalls. BSR has 3 cycle latency, the rest have 1 cycle latency. From RDI input to RAX output, the latency is 3 + 1 + 1 = 5 cycles. (setz has to wait for BSR's output. As I mentioned above, test/setz on a before the bsr would allow instruction-level parallelism, spending an extra uop on test to shorten critical-path latency by another cycle.)

On AMD Bulldozer-family / Ryzen, it's a lot more expensive just because of bsr. The setz is still 1 uop, and the btc is 2 uops.

On Atom/Silvermont, it's also expensive because of BSR.

There's no workaround for slow BSR unless you make a version using lzcnt and do runtime dispatching or something. (Probably for a loop that calls this function; call overhead might still be worse than using bsr on a CPU with 8 to 10 uop bsr. Especially Ryzen with its high uop throughput.)

This works fairly obviously for the a!=0 case, doing BTC into a zeroed register to flip the bit at the bit-index found by BSR.

For the a==0 case:

       # starting: rdx=rax=0
        bsr   %rdi, %rdx       # leaves RDX=0 (unmodified) and sets ZF
        setz  %al              # sets RAX=1
        btc   %rdx, %rax       # flips bit 0 of RAX, changing it back to 0
       # rax = 0 (result), rdx = 0 (bitpos)

When I used two "+r" inputs both set to zero in C, GCC decided to xor-zero EAX/RAX, but used a mov %rax, %rdx to copy that to RDX! (Wasting a REX prefix). That's clearly worse on Sandybridge which has elimination for xor-zeroing but not mov-elimination. It's better on AMD Ryzen, though, which has mov-elimination but still needs a back-end uop for xor-zeroing.

Copying a 0 with mov vs. xor-zeroing is basically neutral on P6-family (Core2 / Nehalem), except that only xor-zeroing avoids partial-register stalls for writing AL with setcc and reading RAX with btc. So I put the xor-zeroing inside inline asm to make sure it's actually xor-zeroing, not a mov, that gcc picks for res. (Why doesn't GCC use partial registers?).

Note that the BSR behaviour of leaving the destination unmodified for the input=0 case is only documented by AMD, not officially documented by Intel. But it is implemented by all Intel's hardware, at least any that supports 64-bit mode. IDK about Via Nano, or ancient Intel.

Upvotes: 4

David Wohlferd

Reputation: 7528

While Michael/Florian's fix is probably the easiest, perhaps it's not the best.

Your existing code (modified with 1UL) compiles down to this:

    xor     eax, eax
    test    rdi, rdi
    je      .L1
    mov     eax, 1
    BSR rdi, rcx   

    sal     rax, cl
.L1:
    ret

Not bad, but instead of testing for zero, and then calling BSR (which also checks for zero), how about:

unsigned long int LeadingBit(unsigned long int a) {

  unsigned long int nlb;
  bool z;
  asm (
       "BSR %[a], %[nlb]"
       : [nlb] "=r" (nlb), "=@ccz"(z)
       : [a] "mr" (a)
       );

    unsigned long int one;
    if (!z)
       one = 1;
    else
       one = 0;

    return one << nlb;
}

Since BSR sets the ZF to indicate zero, this code uses it to set one to either 0 or 1, depending. The asm for this is pretty clean (gcc 9.2 -O2):

BSR rdi, rcx
setne   al
movzx   eax, al
sal     rax, cl
ret

The "=@ccz" constraint is described in the docs under Flag Output Operands. Basically it's just saying "The value of this variable is taken from the Z (C)ondition (C)ode."

Edit: I wasn't surprised to see Peter offer observations, but I was surprised that he didn't post an alternative of his own.

Here's my attempt to translate his comments into code. This may be more complicated than OP would be interested in, but for the sake of completeness:

unsigned long int LeadingBit(unsigned long int a) {

  unsigned long int bit;
  unsigned long int res;
  asm (
       "BSR %[a], %[bit]  \n\t"  // sets both (bit) and ZF
       "BTS %[bit], %[res] \n\t" // sets the corresponding bit in (res) and
                                 // carry flag, but *doesn't* change ZF
       "CMOVZ %[zero], %[res]"   // reset res to 0 if BSR detected a==0
       : [bit] "=&r" (bit), [res] "=&r"(res)
       : [a] "mr" (a), [zero] "r"(0UL), "[res]"(0UL)
       : "cc"
       );

    return res;
}

While the BSR/BTS/CMOVZ is pretty straight-forward, that crap with the constraints might be something that maintainers of your code would struggle with.

So, to explain what's happening.

"=&r"(res) is going to hold the return value. The & is used to indicate that it can't share a register with any other parameter. To be clear, just because you declare a constraint as "=r", that doesn't mean you will get a unique register, unless you use &. This can be a good thing by reducing the number of registers used, but not so much in this case. If the compiler decided to use the same register for %[zero] and %[res], that would cause the CMOVZ to fail.
The "[res]"(0UL) says that on entry to the asm, whatever register is used for %[res] should be initialized to zero. Yes, we can do this in the asm, but by putting it in the C code, it allows the compiler to do this in whatever manner might be most efficient with respect to the surrounding code. Heck, it might already have a zeroed register laying around that it just used for something else.
BTS lets you directly set a bit number in a register. It's typically faster than using SAL (that's what Peter was talking about with "3 uops" for SAL vs 1 for BTS), but like BSR, gcc doesn't have an intrinsic for it. And while it does return the existing value of the specified bit in the carry flag (that's what the T part of BTS means), it doesn't modify the Zero flag. This allows us to still do the CMOVZ after the BTS. Updating "part" of the flags register this way can be inefficient on some processors, but not so much with the newer ones.

Here's the output:

    xorl    %edx, %edx
    movq    %rdx, %rax
    BSR %rdi, %rcx  
    BTS %rcx, %rax 
    CMOVZ %rdx, %rax
    ret

I'm a little dubious about that movq (why not xor or movl?), but I'm sure there's a good reason. Something to do with 'aliasing' I assume.

If perf was of sufficiently high priority (although OP never said it was), there's one more thing I can think to do. If LeadBit might ever be called with a constant, the compiler can typically pre-calculate a lot of the math around it at compile time instead of when you run the program.

However, gcc can't precalculate thru inline asm. If calling LeadBit with a constant is a possibility, you could wrap the code I show here with if (__builtin_constant_p(a)) to see if a is a constant, and use __builtin_clzll et al to calculate it the "slow" way. I say "slow," but calculating value (even the slow way) at compile time is going to result in a faster executable. And since (by definition) the value of __builtin_constant_p is known at compile time, only one branch of the if will be generated for any invocation of LeadingBit.

Upvotes: 8

Florian Weimer

Reputation: 33717

As Michael Petch pointed out, this line is the problem:

  return 1<<nlb;

The computation happens as an int, not an unsigned long. Replace it with:

  return 1UL<<nlb;

Upvotes: 5

How to get BSR instruction to work on 64 bits?

Answers (3)

The obvious way to write this is this portable code that works on any target supported by GCC:

Using inline asm to work around missed optimizations for Sandybridge-family

No partial-flag stalls, and shorter critical path on older Intel

Related Questions