Reputation: 197
I am building a Huffman prefix code LUT by traversing the tree. I am using a register to keep track of the current prefix.
My condition for exiting the algorithm which parses the tree uses the following conditions, which must be true:
current element in tree must be a leaf
the current prefix code has all bits set without false bits.
For the second condition, I am also keeping track of he current length (in bits) of the prefix string.
How can I test that a register has more than 1 bit set and that all set bits are adjacent to each other?
EDIT: The group of set bits must start at bit 0 and be as long as prefix length (stored in another register)
Upvotes: 2
Views: 470
Reputation: 365832
The building block for this is going to be: adding 1 to the low bit of a contiguous group will clear all those bits with carry-out, leaving 1 bit set above the group. e.g. 0xff + 1 = 0x100.
If any bits are unset, carry won't propagate all the way up, e.g. 0b01101 + 1 = 0b01110
, not setting bit #4. (And leaving some of the existing set bits un-flipped, so x & (x+1)
will be true.)
This works for bit-groups at the bottom of a register (adding 1), or at any higher position (isolate the lowest set bit with (-x) & x
and add that, e.g. with BMI1 blsi
or mov/neg/and).
A related bithack is the y & (y-1)
test for an integer having only a single bit set (by clearing the lowest set bit): https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2. But since we're producing y
with x+1
, we can optimize it down to just x & (x+1)
to detect a contiguous mask at the bottom of a register.
n
bits wide (the prefix-length)Those restrictions mean there's exactly 1 integer that matches both requirements, so you should pre-compute it and simply compare for equality with cmp/je
. The number with n
bits set at the bottom is prefix_mask = (1<<n) - 1
. Carry (borrow) by the subtraction sets all the bits below that isolated high bit, and clears the original bit. Bits above that stay unset, because that high bit satisfied the borrow.
Given the prefix-length n
, you can compute 1<<n
using bts
(which is single-uop on Intel CPUs, or 2 uops on AMD, https://agner.org/optimize/).
;; input: prefix length in EDX
;; output: prefix mask in ESI
xor esi, esi
bts esi, edx ; set bit n; eax |= 1<<n
dec esi ; set all the bits BELOW that power of 2
; then later, inside a loop: input in EAX
cmp eax, esi
je all_bits_set_up_to_prefix
@fuz proposed a LUT for this but that sounds like a bad idea even if you have to consider different prefix lengths very frequently. If you're short on registers, you can spill it to stack memory after computing and use something like cmp [rsp+16], edx
instead a static LUT while looping with the same prefix length.
Or spill the prefix length to memory if you don't need it in a register for a while, and just keep the mask.
You can even translate a mask back to a prefix length with lea edx, [eax+1]
/ bsr edx,edx
to find the bit-index of the highest set bit of mask+1
. (Or if a prefix-length of 32 is possible but zero isn't, then bsr
/ inc
. BSR with input=0 leaves the destination unmodified, and sets ZF. AMD documents this, Intel's docs say "undefined" but their current HW does leave the destination unmodified, which is why the instruction has a "false" dependency on the output.)
Non-destructive test of EDX for the low n
bits all-ones, and bit #n itself being 0. (Adding 1 clears the low bits and sets bit n
if that was the case). You can use inc edx
instead of LEA to copy-and-add, if you don't have a use for it after.
;;; check the low n bits, ignoring whether higher bits are set or not
;;; inputs: prefix length in ECX, candidate in EDX
lea eax, [rdx+1]
bt eax, ecx
;;; output: CF = 1 if all the bits from 0..len-1 were set, else 0
If you also want to rule out any higher bits being set you need one more instruction, but it can be a test
instruction that will macro-fuse with a jcc
, so on Intel CPUs this doesn't cost any extra uops. On AMD CPUs, where btr
is 2 uops vs. bt
being 1, this costs 1 extra uop. (test/jcc can fuse on AMD Bulldozer-family and later.)
;;; input: prefix length in ECX, candidate in EDX
lea eax, [rdx+1] ; produces a single set bit?
btr eax, ecx ; reset that bit, leaving eax=0 if no other bits were set
test eax, eax ; compare against zero
;;; output: ZF=1 (and eax=0) if EDX == (1<<ECX)-1 with no higher bits set.
jz contiguous_bitmask_of_length_ecx
This costs 3 uops total on Intel (4 on AMD), including the macro-fused test/jz, to branch on this condition. And it doesn't destroy the input register.
We can check for a single contiguous bit-group of unknown length at the bottom of a register with x & (x+1)
, which does detect if any higher bits were set. If there's a high bit that isn't flipped by carry-propagation, the AND or TEST will produce a non-zero result.
But this treats 0
and 1
the same as multi-bit groups like 0b0111
. You might want cmp eax, 3
/ jb not_multibit_prefix
before this test.
; check for a contiguous bit-group at the bottom of a reg of arbitrary length, including 0
;;; input: candidate in EDX
lea eax, [rdx+1] ; carry-out clears all the set bits at the bottom
test eax, edx ; set flags from x & (x+1)
;;; output: ZF=1 if the only set bits in EDX were a contiguous group at the bottom
I looked at a weird partial-flags hack of lea eax, [rdx+1]
/ test eax, edx
(ZF=1: only contiguous low bits were set) / bt eax, ecx
(CF=1: it ended at the position we want). But x86 doesn't have a jcc
condition that requires CF=1 and ZF=1. ja
is taken if (CF=0 and ZF=0)
, jbe
is taken if (CF=1 or ZF=1)
, so neither works. And of course this would be horrible on CPUs without efficient partial-flag merging, causing a partial-flag stall.
This rules out simple pre-computing.
As mentioned above, we can isolate the lowest set bit with (-x) & x
. BMI1 added an instruction for that, blsi
. So if you can assume BMI1 support, you can do this non-destructively in 1 uop. Otherwise it takes 3.
unsigned bits_contiguous(unsigned x) {
unsigned lowest_set = (-x) & x; // isolate lowest set bit
unsigned add = x + lowest_set; // carry flips all the contiguous set bits
return (add & x) == 0; // did add+carry leave any bits un-flipped?
}
I put this on the Godbolt compiler explorer to see if gcc or clang spotted any optimizations I didn't think of. Of course you don't actually want to materialize a 0 / 1 integer like we're asking the compiler to do, but since they choose to use test
/ setcc
we can just look at what they do to create the right flag condition.
We can write some test functions to make sure the logic is correct for some compile-time constants with #define TEST(n) int test##n(){return bits_contiguous(n);}
(and see if the asm is xor eax,eax
or mov eax,1
). See C + asm on the Godbolt compiler explorer. Some interesting cases are TEST(0)
= 1, because the condition basically checks for there being multiple bit-groups. (So zero bit groups is the same as 1 bit-group, for this check.) TEST(0xFFFFFFFF)
= 1 as well: having x+1 = 0
isn't a problem.
With gcc8.3 -O3, we get
# gcc8.3 -O3 -march=haswell (enables BMI1 and BMI2)
bits_contiguous:
blsi eax, edi
add eax, edi
test eax, edi # x & (x+lowest_set)
sete al
movzx eax, al
ret
Without BMI1, we need 3 instructions instead of 1 for blsi
:
mov eax, edi
neg eax
and eax, edi # eax = isolate_lowest(x)
add eax, edi
test eax, edi
To also check for a specific length of the bit-group, @fuz had a good idea: popcnt
to make sure the right number of bits were set (and separately check that they're contiguous). Popcnt is not baseline, CPUs before Nehalem don't have it and will fault if they try to run it.
;input: prefix len in ECX, candidate in EDX
;clobbers: EAX
popcnt eax, edx
cmp eax, ecx
jne wrong_number_of_bits_set ; skip the contiguousness test
blsi eax, edi
add eax, edi
test eax, edi # x & (x+lowest_set)
jz contiguous_bitgroup_of_length_ecx
wrong_number_of_bits_set:
Upvotes: 2
Reputation: 93162
Let your number be in eax
and the desired prefix length be in edx
.
First of all, to get the number of trailing ones, compute the complement and the compute the number of trailing zeroes (we branch to fail
if the number does not match). The cmovz
is needed as bsf
doesn't like being invoked with 0 as its argument. If this case doesn't appear, you can remove the first three instructions.
mov ebx, 32 ; number of bits in an int
mov ecx, eax
not ecx
bsf ecx, ecx ; count trailing zeroes in ecx
; i.e. trailing ones in eax
cmovz ecx, ebx ; give correct result for eax == -1
cmp ecx, edx
jnz fail ; have as many ones as desired?
If your CPU has tzcnt
, you can avoid this instruction:
mov ecx, eax
not ecx
tzcnt ecx, ecx
cmp ecx, edx
jnz fail
Note that on CPUs that do not have it, tzcnt
is silently interpreted as bsf
, breaking your code; watching for an undefined instruction exception is insufficient to make sure it's present.
To make sure that no other bits are set, clear out the suffix and check if the result is zero:
lea ecx, [rax+1]
and ecx, eax ; ecx = eax & eax + 1
; i.e. clear all trailing 1 bits
jnz fail ; fail if any bits are left
Though the fastest implementation would probably just to make a look up table with all suffix lengths and then check if your suffix matches:
lut: dd 00000000h, 00000001h, 00000003h, 00000007h
dd 0000000fh, 0000001fh, 0000003fh, 0000007fh
dd 000000ffh, 000001ffh, 000003ffh, 000007ffh
dd 00000fffh, 00001fffh, 00003fffh, 00007fffh
dd 0000ffffh, 0001ffffh, 0003ffffh, 0007ffffh
dd 000fffffh, 001fffffh, 003fffffh, 007fffffh
dd 00ffffffh, 01ffffffh, 03ffffffh, 07ffffffh
dd 0fffffffh, 1fffffffh, 3fffffffh, 7fffffffh
dd ffffffffh
...
cmp lut[edx * 4], eax
jnz fail
Upvotes: 0