Display name
Display name

Reputation: 197

Test for all true bits being adjacent

I am building a Huffman prefix code LUT by traversing the tree. I am using a register to keep track of the current prefix.

My condition for exiting the algorithm which parses the tree uses the following conditions, which must be true:

  1. current element in tree must be a leaf

  2. the current prefix code has all bits set without false bits.

For the second condition, I am also keeping track of he current length (in bits) of the prefix string.

How can I test that a register has more than 1 bit set and that all set bits are adjacent to each other?

EDIT: The group of set bits must start at bit 0 and be as long as prefix length (stored in another register)

Upvotes: 2

Views: 470

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365832

The building block for this is going to be: adding 1 to the low bit of a contiguous group will clear all those bits with carry-out, leaving 1 bit set above the group. e.g. 0xff + 1 = 0x100.

If any bits are unset, carry won't propagate all the way up, e.g. 0b01101 + 1 = 0b01110, not setting bit #4. (And leaving some of the existing set bits un-flipped, so x & (x+1) will be true.)

This works for bit-groups at the bottom of a register (adding 1), or at any higher position (isolate the lowest set bit with (-x) & x and add that, e.g. with BMI1 blsi or mov/neg/and).

A related bithack is the y & (y-1) test for an integer having only a single bit set (by clearing the lowest set bit): https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2. But since we're producing y with x+1, we can optimize it down to just x & (x+1) to detect a contiguous mask at the bottom of a register.


Your specific case is very easy:

  • bottom of the bit-range must be bit 0
  • bit-range is exactly n bits wide (the prefix-length)

Those restrictions mean there's exactly 1 integer that matches both requirements, so you should pre-compute it and simply compare for equality with cmp/je. The number with n bits set at the bottom is prefix_mask = (1<<n) - 1. Carry (borrow) by the subtraction sets all the bits below that isolated high bit, and clears the original bit. Bits above that stay unset, because that high bit satisfied the borrow.

Given the prefix-length n, you can compute 1<<n using bts (which is single-uop on Intel CPUs, or 2 uops on AMD, https://agner.org/optimize/).

;; input: prefix length in EDX
;; output: prefix mask in ESI
xor    esi, esi
bts    esi, edx         ; set bit n;  eax |= 1<<n
dec    esi              ; set all the bits BELOW that power of 2

 ; then later, inside a loop: input in EAX
  cmp   eax, esi
  je    all_bits_set_up_to_prefix

@fuz proposed a LUT for this but that sounds like a bad idea even if you have to consider different prefix lengths very frequently. If you're short on registers, you can spill it to stack memory after computing and use something like cmp [rsp+16], edx instead a static LUT while looping with the same prefix length.

Or spill the prefix length to memory if you don't need it in a register for a while, and just keep the mask.

You can even translate a mask back to a prefix length with lea edx, [eax+1] / bsr edx,edx to find the bit-index of the highest set bit of mask+1. (Or if a prefix-length of 32 is possible but zero isn't, then bsr / inc. BSR with input=0 leaves the destination unmodified, and sets ZF. AMD documents this, Intel's docs say "undefined" but their current HW does leave the destination unmodified, which is why the instruction has a "false" dependency on the output.)


Or without pre-computing

Non-destructive test of EDX for the low n bits all-ones, and bit #n itself being 0. (Adding 1 clears the low bits and sets bit n if that was the case). You can use inc edx instead of LEA to copy-and-add, if you don't have a use for it after.

;;; check the low n bits, ignoring whether higher bits are set or not
;;; inputs: prefix length in ECX, candidate in EDX
    lea  eax, [rdx+1]             
    bt   eax, ecx
;;; output: CF = 1 if all the bits from 0..len-1 were set, else 0

If you also want to rule out any higher bits being set you need one more instruction, but it can be a test instruction that will macro-fuse with a jcc, so on Intel CPUs this doesn't cost any extra uops. On AMD CPUs, where btr is 2 uops vs. bt being 1, this costs 1 extra uop. (test/jcc can fuse on AMD Bulldozer-family and later.)

;;; input: prefix length in ECX, candidate in EDX
   lea     eax, [rdx+1]      ; produces a single set bit?
   btr     eax, ecx          ; reset that bit, leaving eax=0 if no other bits were set
   test    eax, eax          ; compare against zero
;;; output: ZF=1 (and eax=0) if EDX == (1<<ECX)-1 with no higher bits set.

   jz     contiguous_bitmask_of_length_ecx

This costs 3 uops total on Intel (4 on AMD), including the macro-fused test/jz, to branch on this condition. And it doesn't destroy the input register.


We can check for a single contiguous bit-group of unknown length at the bottom of a register with x & (x+1), which does detect if any higher bits were set. If there's a high bit that isn't flipped by carry-propagation, the AND or TEST will produce a non-zero result.

But this treats 0 and 1 the same as multi-bit groups like 0b0111. You might want cmp eax, 3 / jb not_multibit_prefix before this test.

; check for a contiguous bit-group at the bottom of a reg of arbitrary length, including 0
;;; input: candidate in EDX
   lea   eax, [rdx+1]              ; carry-out clears all the set bits at the bottom
   test  eax, edx                  ; set flags from x & (x+1)
;;; output: ZF=1 if the only set bits in EDX were a contiguous group at the bottom

I looked at a weird partial-flags hack of lea eax, [rdx+1] / test eax, edx (ZF=1: only contiguous low bits were set) / bt eax, ecx (CF=1: it ended at the position we want). But x86 doesn't have a jcc condition that requires CF=1 and ZF=1. ja is taken if (CF=0 and ZF=0), jbe is taken if (CF=1 or ZF=1), so neither works. And of course this would be horrible on CPUs without efficient partial-flag merging, causing a partial-flag stall.


The general case: bitgroup doesn't have to start at the bottom.

This rules out simple pre-computing.

As mentioned above, we can isolate the lowest set bit with (-x) & x. BMI1 added an instruction for that, blsi. So if you can assume BMI1 support, you can do this non-destructively in 1 uop. Otherwise it takes 3.

unsigned bits_contiguous(unsigned x) {
    unsigned lowest_set = (-x) & x;  // isolate lowest set bit
    unsigned add = x + lowest_set;   // carry flips all the contiguous set bits
    return (add & x) == 0;           // did add+carry leave any bits un-flipped?
}

I put this on the Godbolt compiler explorer to see if gcc or clang spotted any optimizations I didn't think of. Of course you don't actually want to materialize a 0 / 1 integer like we're asking the compiler to do, but since they choose to use test / setcc we can just look at what they do to create the right flag condition.

We can write some test functions to make sure the logic is correct for some compile-time constants with #define TEST(n) int test##n(){return bits_contiguous(n);} (and see if the asm is xor eax,eax or mov eax,1). See C + asm on the Godbolt compiler explorer. Some interesting cases are TEST(0) = 1, because the condition basically checks for there being multiple bit-groups. (So zero bit groups is the same as 1 bit-group, for this check.) TEST(0xFFFFFFFF) = 1 as well: having x+1 = 0 isn't a problem.

With gcc8.3 -O3, we get

# gcc8.3 -O3 -march=haswell  (enables BMI1 and BMI2)
bits_contiguous:
    blsi    eax, edi
    add     eax, edi
    test    eax, edi         # x & (x+lowest_set)

    sete    al
    movzx   eax, al
    ret

Without BMI1, we need 3 instructions instead of 1 for blsi:

    mov     eax, edi
    neg     eax
    and     eax, edi         # eax = isolate_lowest(x)
    add     eax, edi
    test    eax, edi

To also check for a specific length of the bit-group, @fuz had a good idea: popcnt to make sure the right number of bits were set (and separately check that they're contiguous). Popcnt is not baseline, CPUs before Nehalem don't have it and will fault if they try to run it.

;input: prefix len in ECX, candidate in EDX
;clobbers: EAX
    popcnt  eax, edx
    cmp     eax, ecx
    jne     wrong_number_of_bits_set  ; skip the contiguousness test

    blsi    eax, edi
    add     eax, edi
    test    eax, edi         # x & (x+lowest_set)
    jz     contiguous_bitgroup_of_length_ecx

wrong_number_of_bits_set:

Upvotes: 2

fuz
fuz

Reputation: 93162

Let your number be in eax and the desired prefix length be in edx.

First of all, to get the number of trailing ones, compute the complement and the compute the number of trailing zeroes (we branch to fail if the number does not match). The cmovz is needed as bsf doesn't like being invoked with 0 as its argument. If this case doesn't appear, you can remove the first three instructions.

       mov ebx, 32      ; number of bits in an int

       mov ecx, eax
       not ecx
       bsf ecx, ecx     ; count trailing zeroes in ecx
                        ; i.e. trailing ones in eax
       cmovz ecx, ebx   ; give correct result for eax == -1
       cmp ecx, edx
       jnz fail         ; have as many ones as desired?

If your CPU has tzcnt, you can avoid this instruction:

        mov ecx, eax
        not ecx
        tzcnt ecx, ecx
        cmp ecx, edx
        jnz fail

Note that on CPUs that do not have it, tzcnt is silently interpreted as bsf, breaking your code; watching for an undefined instruction exception is insufficient to make sure it's present.

To make sure that no other bits are set, clear out the suffix and check if the result is zero:

        lea ecx, [rax+1]
        and ecx, eax      ; ecx = eax & eax + 1
                          ; i.e. clear all trailing 1 bits
        jnz fail          ; fail if any bits are left

Though the fastest implementation would probably just to make a look up table with all suffix lengths and then check if your suffix matches:

lut:    dd 00000000h, 00000001h, 00000003h, 00000007h
        dd 0000000fh, 0000001fh, 0000003fh, 0000007fh
        dd 000000ffh, 000001ffh, 000003ffh, 000007ffh
        dd 00000fffh, 00001fffh, 00003fffh, 00007fffh
        dd 0000ffffh, 0001ffffh, 0003ffffh, 0007ffffh
        dd 000fffffh, 001fffffh, 003fffffh, 007fffffh
        dd 00ffffffh, 01ffffffh, 03ffffffh, 07ffffffh
        dd 0fffffffh, 1fffffffh, 3fffffffh, 7fffffffh
        dd ffffffffh

        ...

        cmp lut[edx * 4], eax
        jnz fail

Upvotes: 0

Related Questions