Reputation: 169

How to remove trailing zeros from a binary number

I have an integer of type long long that I want to convert to a new integer after removing the trailing zeros of that integer that are present in its binary representation.

Upvotes: 4

Answers (2)

chqrlie

Reputation: 145317

Here is a brute force approach:

long long remove_trailing_zeroes(long long v) {
    if (v != 0) {
        while ((v & 1) == 0)
            v /= 2;
    }
    return v;
}

Here is a direct approach for unsigned numbers, but the division might be more costly than the above iteration:

unsigned long long remove_trailing_zeroes(unsigned long long v) {
    if (v != 0) {
        // v and (v - 1) differ only in the trailing 0 bits plus 1
        // shifting v ^ (v - 1) right by 1 and adding 1 gives the power of 2
        // by which to divide v to remove all trailing 0 bits
        v /= (((v ^ (v - 1)) >> 1) + 1);
    }
    return v;
}

harold suggested this simplification:

unsigned long long remove_trailing_zeroes(unsigned long long v) {
    if (v != 0) {
        // `-v`, which is `(~v + 1)` has all bits flipped except the least
        // significant 1 bit.
        // dividing v by `-v & v` shifts all trailing zero bits out,
        v /= -v & v;
    }
    return v;
}

Which can be simplified as a single expression:

unsigned long long remove_trailing_zeroes(unsigned long long v) {
    return v ? v / (-v & v) : v;
}

To avoid the division, you could count the number of bits in v ^ (v - 1) with an efficient method and shift v right by one less than this number. This would work for 0 as well so you would get branchless code.

You can find other methods in the fascinating word of Bit Twiddling Hacks.

Upvotes: 5

YurkoFlisk

Reputation: 1055

If you use x86, then bsf (or tzcnt with BMI extension) can be used via, e.g., __builtin_ctzl (supported on both GCC and Clang) to find the number of trailing zeros. Then removing them is just a matter of a single shift:

unsigned long long remove_trailing_zeroes_ctz(unsigned long long v) {
    return v ? v >> __builtin_ctzl(v) : 0;
}

unsigned long long remove_trailing_zeroes_ctz_unchecked(unsigned long long v) {
    return v >> __builtin_ctzl(v);
}

int test_ub() { // see below
    return remove_trailing_zeroes_ctz(0) == remove_trailing_zeroes_ctz_unchecked(0);
}

godbolt (also contains the last remove_trailing_zeroes approach from chqrlie's answer for comparison)

Assembly output for GCC (-O3):

remove_trailing_zeroes:
        mov     rax, rdi
        test    rdi, rdi
        je      .L2
        mov     rcx, rdi
        xor     edx, edx
        neg     rcx
        and     rcx, rdi
        div     rcx
.L2:
        ret
remove_trailing_zeroes_ctz:
        xor     ecx, ecx
        mov     rdx, rdi
        mov     rax, rdi
        rep bsf rcx, rdi
        shr     rdx, cl
        test    rdi, rdi
        cmovne  rax, rdx
        ret
remove_trailing_zeroes_ctz_unchecked:
        xor     ecx, ecx
        mov     rax, rdi
        rep bsf rcx, rdi
        shr     rax, cl
        ret
test_ub:
        mov     eax, 1
        ret

Assembly output for Clang (-O3):

remove_trailing_zeroes:                 # @remove_trailing_zeroes
        test    rdi, rdi
        je      .LBB0_1
        mov     rcx, rdi
        neg     rcx
        and     rcx, rdi
        mov     rax, rdi
        shr     rax, 32
        je      .LBB0_3
        mov     rax, rdi
        xor     edx, edx
        div     rcx
        ret
.LBB0_1:
        xor     eax, eax
        ret
.LBB0_3:
        mov     eax, edi
        xor     edx, edx
        div     ecx
        ret
remove_trailing_zeroes_ctz:             # @remove_trailing_zeroes_ctz
        rep       bsf rcx, rdi
        mov     rax, rdi
        shr     rax, cl
        test    rdi, rdi
        cmove   rax, rdi
        ret
remove_trailing_zeroes_ctz_unchecked:   # @remove_trailing_zeroes_ctz_unchecked
        mov     rax, rdi
        rep       bsf rcx, rdi
        shr     rax, cl
        ret
test_ub:                                # @test_ub
        ret

We can see that on both GCC and Clang the division version contains branches (1 on GCC and 2 on Clang because the latter performs a peculiar optimization where it tries, for better average performance, to use 32-bit div for 64-bit division if arguments allow) and, obviously, the div instruction, while __builtin_ctlz versions have no heavyweight instructions like div, but rather rep bsf (see the next paragraph) with shr as expected with only 1 additional conditional move for zero-checked version.

rep bsf instructions you can see in the assembly are bsf with rep prefix. Due to the way this prefix (F3) and bsf (0F BC) are coded and tzcnt opcode was chosen, rep bsf and tzcnt correspond to the same 3-byte prefix-opcode F3 0F BC in machine code if they are used in 32-bit mode or (as in our case) 4-byte F3 48 0F BC with additional REX.W prefix (48). This prefix-opcode is interpreted as tzcnt on processors with BMI support and as just bsf on processors without BMI support (in the latter case rep prefix is just ignored as unsupported for bsf instruction). Note that tzcnt is sometimes faster than bsf (described in some detail here) and they differ in how they handle zero input (namely, tzcnt sets destination register to 32 or 64 depending on argument size, while bsf leaves it with its old value). Due to the latter difference rep bsf can't be trusted to have stable outputs for zero argument on processors with and without BMI, so rep bsf uses are surrounded by zero checks. That's what -mbmi option can usually help with - it tells compiler to assume that the code is run only on BMI-enabled processor where the same rep bsf instruction can only be interpreted as tzcnt and thus, if its default zero-handling behaviour (which is defined) suits us, the zero checks can be omitted. You can see the difference clearly in this example (though note that I did it a bit earlier for C++) by adding/removing -mbmi option. Note that rep bsf being replaced with tzcnt in disassembler output when -mbmi is present isn't very important since they still have the same prefix-opcodes. However, for our purposes the default tzcnt behaviour wouldn't be sufficient even if we had the guarantee for __builtin_ctzl to work as tzcnt when -mbmi is present (we don't) since, as explained below in "About zero-checking and UB", we would stumble into shift amount UB problem, so we still have to have a manual check for zero which surely doesn't go away when compiled with -mbmi. The only useful thing -mbmi option does for us, then, is actually reducing the size of the division version (remove_trailing_zeroes) by two instructions by utilizing blsi instruction (which basically does -v & v).

About zero-checking and UB

Checking for zero is needed, strictly speaking, since the return value of __builtin_ctzl is undefined for zero argument. You may think that when v == 0 it doesn't matter what we right-shift it with, but actually in both C and C++ it's undefined behaviour to shift (whether left or right) a number of (promoted) type T by something outside of the range of [0; N - 1], where N is the bit width of T (i.e. sizeof(T) * CHAR_BIT). You can see this UB manifest itself in how Clang treats 0 >> 32 being used as the return value of a function here (it just doesnt touch the rax register and, thus, main returns garbage). And __builtin_ctz(0) may theoretically return anything, though if it uses tzcnt it returns bit size of the type, which still would cause UB as a shift amount.

Practically, though, when shift amount isn't known at compile time, I doubt compilers would do anything other then use shr instruction in an expected way while compiling remove_trailing_zeroes_ctz_unchecked (itself; see below). And shr itself has perfectly defined behaviour for all shift amounts (though it may seem unexpected and cause surprises like here), which is to use the lower 5 or 6 bits of the shift amount depending on whether 32-bit or 64-bit version of instruction is used (i.e. shift amount is taken mod 32 or 64 respectively). For 0, then, shr by any amount results in 0, unlike div by 0 which results in a #DE exception. You can see that the assembly for remove_trailing_zeroes_ctz_unchecked is correct for both compilers and would work well if applied as is. However, it still can cause issues when evaluated with compile-time constant 0, as compiler evaluates (UB) function, not (correct wrt our intention) assembly code to which it compiles, which results in compile-time shift by compile-time evaluated __builtin_ctzl(0), which is either an indeterminate value or, maybe, internally actually is just 64 (in accordance with tzcnt behaviour), which regardless results in easy-to-abuse UB (unlike when 0 is a runtime argument for our function which has to be compiled to work for general case and an obvious/optimal resulting assembly happens to work as intended for 0 as well for aforementioned reasons) which manifests itself in Clang behaviour (test_ub compiles to return garbage instead of intended 1 as with GCC).

Summing up, if you can use intrinsics, I would recommend using remove_trailing_zeroes_ctz_unchecked if you know your input can't be zero (e.g. it can be ensured as part of some prior input validation). If it can be zero, especially if the function may be evaluated at compile time (whether it's true is not always obvious considering all the constant propagation that may end up giving us compile-time constant as argument where it didn't look like that at first glance) and you are using Clang, then better use the remove_trailing_zeroes_ctz version.

For C++ you could use std::countr_zero (doc) since C++20 which functions as tzcnt (including regarding zero argument) and allows to do the above nicely and portably (same considerations regarding shifting, checking for 0 and UB apply except that at least the result of countr_zero(0) itself is defined to be size of argument type). C, unfortunately, doesn't have such standard functions as of now:

unsigned long long remove_trailing_zeros_std(unsigned long long v) {
    return v ? v >> std::countr_zero(v) : 0;
}
unsigned long long remove_trailing_zeros_std_unchecked(unsigned long long v) {
    return v >> std::countr_zero(v);
}

Upvotes: 0

How to remove trailing zeros from a binary number

Answers (2)

Assembly output for GCC (-O3):

Assembly output for Clang (-O3):

About zero-checking and UB

Related Questions