Reputation: 89
I am trying to find the leading bit of a unsigned 64bit int. I am using BSR as my processor doesn't have the LZCNT instruction. My problem is that once the input is exactly 2^32, it returns 2^64 as the leading bit value, then cycles back through outputs up until 2^64.
This is my code:
unsigned long int LeadingBit(unsigned long int a) {
if(a==0)
return 0;
unsigned long int nlb;
asm (
"BSR %1, %0 \n"
: "=r" (nlb)
: "mr" (a)
: "cc"
);
return 1<<nlb;
}
The intention of this code is to be able to input a 64-bit integer, and have it return the value of the position of the leading 1. e.g: a = 65 (1000001) returns 1000000.
Upvotes: 6
Views: 4039
Reputation: 365277
TL:DR: you should only use inline asm if you're going to optimize the whole thing and beat the compiler. Otherwise use intrinsics like __builtin_clzll
or GCC-only __builtin_ia32_bsrdi
Your actual problem was unrelated to inline asm.
1
has type int
, and unlike most operators, <<
doesn't promote the left side to match the right side. So you're shifting an int
by more than 31 bits, which is undefined behaviour. In practice you get a 32-bit operand-size shift, which you could have noticed by looking at the compiler's asm output. (Generally a good idea whenever you're using inline asm.)
You don't need inline asm for this; it's easy to express using GNU C builtins.
But if performance is critical, you might want to use inline asm to work around compiler missed-optimizations for current microarchitectures in the shift part, especially if you're compiling without BMI2 for efficient variable-count shifts on Intel CPUs.
Note that bsr
is quite slow on AMD CPUs while lzcnt
is fast on all CPUs that support it. On Intel, both are fast. But unlike for bsf/tzcnt
, the instructions produce different results even for non-zero inputs so using rep bsr
to get a fast lzcnt
on newer CPUs wouldn't be useful the way it sometimes is for tzcnt
. (tzcnt runs as bsf on older CPUs, lzcnt runs as bsr on older CPUs. That's because the encoding for lzcnt is a REP prefix in front of bsr)
Unfortunately bsr
is quite slow on Atom/Silvermont, too. Like on Goldmont Plus: 11 uops, 9 cycle latency, and 8 cycle throughput.
See my answer on Find most significant bit (left-most) that is set in a bit array for a rundown of how various compilers are dumb with 63-__builtin_clzll(x);
which could optimize back to just a bsr
, but doesn't.
GCC specifically (not clang) has an intrinsic __builtin_ia32_bsrdi
for 64-bit bsr
, and both support _bit_scan_reverse
for 32-bit.
uint64_t LeadingBitIsolate_gnuc(uint64_t a)
{
static_assert( CHAR_BIT * sizeof(unsigned long) == 64, "__builtin_clzll isn't 64-bit operand size");
uint64_t bit = 63-__builtin_clzll(a); // BSR
return a ? (1ULL << bit) : 0; // ULL is guaranteed to be at least a 64-bit type
}
If a
is ever a compile-time constant, constant-propagation through this function works. Using inline asm always defeats that unless you use something like if(__builtin_constant_p(a)) { non-asm version } else { asm version}
. (https://gcc.gnu.org/wiki/DontUseInlineAsm)
I used uint64_t
for portability to x86-64 targets where unsigned long
is a 32-bit type. (Linux x32 (an ILP32 ABI), and MS Windows x64). Also to non-x86 targets since this doesn't use any inline asm.
Unfortunately this compiles to pretty bad asm with gcc and clang (Godbolt)
# gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_gnuc(unsigned long):
movq %rdi, %rax
testq %rdi, %rdi
je .L4
bsrq %rdi, %rax
movl $63, %ecx
xorq $63, %rax # bit = 63 - bit in the low 6 bits, 2's complement bithack
subl %eax, %ecx # bit = 63 - bit
movl $1, %eax # 1 uop
salq %cl, %rax # 3 uops
.L4:
ret
Using BSR64
= __builtin_ia32_bsrdi
from my answer on Find most significant bit (left-most) that is set in a bit array we get this from GCC (and similar using cmov
from clang and ICC). ICC/MSVC provide intrinsics that return a bool
for the ZF output from BSR. test/je
has the same cost as je
on modern x86, but saving the test
instruction in front of cmov
does have value.
// See the linked answer for the ICC/MSVC part of this
#ifdef __GNUC__
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
// update: clang8.0 regressed here but still doesn't do __builtin_ia32_bsrdi
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
uint64_t LeadingBitIsolate_gnuc_v2(uint64_t a)
{
uint64_t bit = BSR64(a);
return a ? (1ULL << bit) : 0;
}
This compiles better with GCC, at least for the BSR part. But still not for the shift part.
# gcc -O3 -mtune=skylake
LeadingBitIsolate_gnuc_v2(unsigned long):
movq %rdi, %rax
testq %rdi, %rdi
je .L9
bsrq %rdi, %rcx
movl $1, %eax
salq %cl, %rax # missed optimization: xor-zero + bts would be much better especially with tune=skylake
.L9:
ret
On the plus side here, we've optimized out most of the BSR overhead by using a BSR intrinsic. And it can still compile to LZCNT on CPUs where that's a better choice, e.g. with clang -march=znver1
(Ryzen). GCC still uses BSR but clang uses 63-lzcnt
which will run faster on AMD.
63-__builtin_clzll
is portable to non-x86 but __builtin_ia32_bsrdi
isn't. (DI = double-word integer = 64-bit).
But the compiler does still know what's happening, and can optimize this into surrounding code, and for compile-time-constant inputs.
E.g. it can branch over other things, too, if the input=0, only testing that once. And it knows that the input only has a single bit set, so in theory if you're using it as a divisor a compiler could be smart enough to AND with res-1
instead of using a slow div
instruction. (Division by zero is UB so the compiler can assume that didn't happen, maybe even optimizing away the a==0 ?
part of this after inlining.)
Compiling this with BMI2 available for SHLX would make it efficient on modern Intel.
SnB-family has slower-than-usual variable-count shifts: instead of potential stalls if the flag-result is read, it includes 2 extra uops to check if the count is zero and conditionally merge the shift's flag-result into FLAGS. (shl %cl, %reg
has to leave FLAGS unmodified if cl=0
. This is the kind of legacy baggage that's part of the "x86 tax" that high performance x86 CPUs have to pay to do superscalar out-of-order execution.)
AMD apparently manages shifts as a single uop without limitations / penalties.
Thus bts
into a zeroed register is a better choice for implementing 1ULL << count
, especially on Intel CPUs where xor-zeroing is extra cheap, and bts reg,reg
is a single uop (instead of 2 on Ryzen).
Compared to the version in @DavidWohlferd's answer (based on my comments), this saves instructions by using a
as the zero in the case where a==0
, instead of needing an extra zeroed register. And the comments talk about performance implications.
#include <stdint.h>
#include <limits.h>
uint64_t LeadingBitIsolate_asm(uint64_t a)
{
uint64_t bit;
uint64_t bts_target = 0;
asm (
"BSR %[a], %[bit] \n\t" // ZF = (a==0)
"BTS %[bit], %[res] \n\t" // sets CF = old bit = 0 but not ZF
// possible partial-flag stall on some CPUs for reading ZF after writing CF
// but not SnB-family
"CMOVZ %[a], %[res]" // res = a (= 0) if BSR detected a==0
: [bit] "=r" (bit), [res] "+r"(bts_target) // no early clobber: input read before either output written
: [a] "r" (a)
: "cc" // redundant but not harmful on x86: flags clobber is always implied
);
return bts_target;
}
Reading ZF after BTS writes CF is dangerous for older Intel CPUs, where it will cause a partial-flag stall. See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
GCC 9.2 gives us a sequence that's only 4 uops on Skylake. (https://agner.org/optimize/). The xor-zeroing needs a front-end uops, but on Sandybridge-family doesn't need a back-end execution unit (0 unfused domain uops).
It's 5 uops on Haswell and earlier where cmov
is 2 uops / 2c latency. Since you say your CPU doesn't support lzcnt
, you might have an IvyBridge or earlier (2 uop cmov), or an old AMD.
Branching might be worth it if input=0 is expected to be rare or never happen, especially if this is part of a critical path for latency. Especially on older Intel where cmov
is 2 uops.
gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_asm(unsigned long):
xorl %eax, %eax
BSR %rdi, %rdi
BTS %rdi, %rax
CMOVZ %rdi, %rax
ret
A "rm"(a)
input constraint would be possible, but
m
when it's an option, even if that means spilling a value that was already in a register. clang (LLVM) inline assembly - multiple constraints with useless spills / reloadsbsr
has a "false" dependency on existing CPUs, so it's better if the compiler happens to pick the same register for res
and a
instead of writing a new register. This is what usually happens if a
isn't needed later.BSR isn't really a false dependency: if the input is zero, the destination is left unmodified. AMD even documents this, but Intel implements it in their hardware while leaving their documentation saying "undefined" contents.
We can't take advantage of this anyway, we have 65 possible outputs and BTS into a zeroed reg can only produce 64 different outputs.
You might be tempted to use rcl
(rotate-through-carry) on a register holding 1
but first of all rcl %cl, %reg
is quite slow, and 2nd it still masks the shift with & 63
so it can't shift a 1
all the way out into CF.
Using BTC
and taking advantage of BSR
zero-input behaviour we can make a version that's possibly better on older Intel P6-family, like Nehalem and Core 2 (where cmov
is 2 uops and partial-flag stalls are a problem).
It doesn't save uops though because it requires an extra xor-zeroed register. It does shorten the critical path latency, and could shorten it even more if we used test
/setz
in parallel with 3-cycle-latency bsr
, instead of using the ZF output from BSR. Or if the compiler already did math on the input, it might already have ZF set appropriately. Unfortunately there's no way to query that.
btc
= complement = flip a bit, instead of unconditionally setting. This can create a zero result if the destination register is 1
instead of 0
, if we know the bit-index for that case.
xor-zero / set flags / setcc is a standard idiom, generally at least as efficient as set flags / setcc / movzx because xor-zeroing can be even cheaper than movzx, and it's off the critical path for latency. setcc
is a 1-uop instruction on all x86-64 CPUs.
// depends on BSR behaviour that only AMD documents, but all real CPUs implement (for now?)
// it seems unlikely that this will change, though.
// The main advantage of this version is on P6-family and early Sandybridge
// where it definitely does work safely.
uint64_t LeadingBitIsolate_asm_p6(uint64_t a)
{
uint64_t btc_target;
uint64_t bit = 0;
//bool input_was_zero;
asm (
"xor %k[res], %k[res]\n\t" // make sure we avoid P6-family partial-reg stalls with setz + reading full reg by forcing xor-zeroing, not MOV
"bsr %[a], %[bit] \n\t" // bit=count or unmodified (if a==0)
"setz %b[res]\n\t" // res = (a==0)
"btc %[bit], %[res] \n\t" // flip a bit. For a==0, flips bit 0 which SETZ set to 1
: [bit] "+r" (bit), [res] "=&q"(btc_target) // q = reg accessible as low-8, like RAX has AL. Any x86-64 reg
// early-clobber: [res] is written before a, but [bit] isn't.
// ,"=@ccz" (input_was_zero) // optional GCC6 flag output. Or "=@ccc" to read from CF (btc result) and avoid partial-flag stalls
: [a] "r" (a)
: "cc"
);
return btc_target;
//return input_was_zero;
}
GCC9 and trunk has a weird register-allocation missed optimization where it produces res
in r8
and has to mov
it back to RAX.
gcc8.3 -O3 -mtune=nehalem
LeadingBitIsolate_asm_p6(unsigned long):
xorl %edx, %edx # compiler-generated
xor %eax, %eax # inline asm
bsr %rdi, %rdx
setz %al
btc %rdx, %rax
ret
On Intel CPUs like Core2, Nehalem, and Sandybridge-family, this is 5 uops with no stalls. BSR has 3 cycle latency, the rest have 1 cycle latency. From RDI input to RAX output, the latency is 3 + 1 + 1 = 5 cycles. (setz
has to wait for BSR's output. As I mentioned above, test/setz on a
before the bsr
would allow instruction-level parallelism, spending an extra uop on test
to shorten critical-path latency by another cycle.)
On AMD Bulldozer-family / Ryzen, it's a lot more expensive just because of bsr
. The setz
is still 1 uop, and the btc
is 2 uops.
On Atom/Silvermont, it's also expensive because of BSR.
There's no workaround for slow BSR unless you make a version using lzcnt
and do runtime dispatching or something. (Probably for a loop that calls this function; call overhead might still be worse than using bsr
on a CPU with 8 to 10 uop bsr
. Especially Ryzen with its high uop throughput.)
This works fairly obviously for the a!=0
case, doing BTC into a zeroed register to flip the bit at the bit-index found by BSR.
For the a==0
case:
# starting: rdx=rax=0
bsr %rdi, %rdx # leaves RDX=0 (unmodified) and sets ZF
setz %al # sets RAX=1
btc %rdx, %rax # flips bit 0 of RAX, changing it back to 0
# rax = 0 (result), rdx = 0 (bitpos)
When I used two "+r"
inputs both set to zero in C, GCC decided to xor-zero EAX/RAX, but used a mov %rax, %rdx
to copy that to RDX! (Wasting a REX prefix). That's clearly worse on Sandybridge which has elimination for xor-zeroing but not mov-elimination. It's better on AMD Ryzen, though, which has mov-elimination but still needs a back-end uop for xor-zeroing.
Copying a 0
with mov
vs. xor-zeroing is basically neutral on P6-family (Core2 / Nehalem), except that only xor-zeroing avoids partial-register stalls for writing AL with setcc
and reading RAX with btc
. So I put the xor-zeroing inside inline asm to make sure it's actually xor-zeroing, not a mov, that gcc picks for res
. (Why doesn't GCC use partial registers?).
Note that the BSR behaviour of leaving the destination unmodified for the input=0 case is only documented by AMD, not officially documented by Intel. But it is implemented by all Intel's hardware, at least any that supports 64-bit mode. IDK about Via Nano, or ancient Intel.
Upvotes: 4
Reputation: 7528
While Michael/Florian's fix is probably the easiest, perhaps it's not the best.
Your existing code (modified with 1UL) compiles down to this:
xor eax, eax
test rdi, rdi
je .L1
mov eax, 1
BSR rdi, rcx
sal rax, cl
.L1:
ret
Not bad, but instead of testing for zero, and then calling BSR (which also checks for zero), how about:
unsigned long int LeadingBit(unsigned long int a) {
unsigned long int nlb;
bool z;
asm (
"BSR %[a], %[nlb]"
: [nlb] "=r" (nlb), "=@ccz"(z)
: [a] "mr" (a)
);
unsigned long int one;
if (!z)
one = 1;
else
one = 0;
return one << nlb;
}
Since BSR sets the ZF to indicate zero, this code uses it to set one
to either 0 or 1, depending. The asm for this is pretty clean (gcc 9.2 -O2):
BSR rdi, rcx
setne al
movzx eax, al
sal rax, cl
ret
The "=@ccz" constraint is described in the docs under Flag Output Operands. Basically it's just saying "The value of this variable is taken from the Z (C)ondition (C)ode."
Here's my attempt to translate his comments into code. This may be more complicated than OP would be interested in, but for the sake of completeness:
unsigned long int LeadingBit(unsigned long int a) {
unsigned long int bit;
unsigned long int res;
asm (
"BSR %[a], %[bit] \n\t" // sets both (bit) and ZF
"BTS %[bit], %[res] \n\t" // sets the corresponding bit in (res) and
// carry flag, but *doesn't* change ZF
"CMOVZ %[zero], %[res]" // reset res to 0 if BSR detected a==0
: [bit] "=&r" (bit), [res] "=&r"(res)
: [a] "mr" (a), [zero] "r"(0UL), "[res]"(0UL)
: "cc"
);
return res;
}
While the BSR/BTS/CMOVZ is pretty straight-forward, that crap with the constraints might be something that maintainers of your code would struggle with.
So, to explain what's happening.
"=&r"(res)
is going to hold the return value. The &
is used to indicate that it can't share a register with any other parameter. To be clear, just because you declare a constraint as "=r", that doesn't mean you will get a unique register, unless you use &
. This can be a good thing by reducing the number of registers used, but not so much in this case. If the compiler decided to use the same register for %[zero] and %[res], that would cause the CMOVZ to fail."[res]"(0UL)
says that on entry to the asm, whatever register is used for %[res] should be initialized to zero. Yes, we can do this in the asm, but by putting it in the C code, it allows the compiler to do this in whatever manner might be most efficient with respect to the surrounding code. Heck, it might already have a zeroed register laying around that it just used for something else.BTS
lets you directly set a bit number in a register. It's typically faster than using SAL
(that's what Peter was talking about with "3 uops" for SAL vs 1 for BTS), but like BSR, gcc doesn't have an intrinsic for it. And while it does return the existing value of the specified bit in the carry flag (that's what the T part of BTS means), it doesn't modify the Zero flag. This allows us to still do the CMOVZ after the BTS. Updating "part" of the flags register this way can be inefficient on some processors, but not so much with the newer ones.Here's the output:
xorl %edx, %edx
movq %rdx, %rax
BSR %rdi, %rcx
BTS %rcx, %rax
CMOVZ %rdx, %rax
ret
I'm a little dubious about that movq
(why not xor
or movl
?), but I'm sure there's a good reason. Something to do with 'aliasing' I assume.
If perf was of sufficiently high priority (although OP never said it was), there's one more thing I can think to do. If LeadBit might ever be called with a constant, the compiler can typically pre-calculate a lot of the math around it at compile time instead of when you run the program.
However, gcc can't precalculate thru inline asm. If calling LeadBit with a constant is a possibility, you could wrap the code I show here with if (__builtin_constant_p(a))
to see if a
is a constant, and use __builtin_clzll et al to calculate it the "slow" way. I say "slow," but calculating value (even the slow way) at compile time is going to result in a faster executable. And since (by definition) the value of __builtin_constant_p is known at compile time, only one branch of the if
will be generated for any invocation of LeadingBit.
Upvotes: 8
Reputation: 33717
As Michael Petch pointed out, this line is the problem:
return 1<<nlb;
The computation happens as an int
, not an unsigned long
. Replace it with:
return 1UL<<nlb;
Upvotes: 5