Reputation: 169
I have an integer of type long long that I want to convert to a new integer after removing the trailing zeros of that integer that are present in its binary representation.
Upvotes: 4
Views: 3399
Reputation: 144770
Here is a brute force approach:
long long remove_trailing_zeroes(long long v) {
if (v != 0) {
while ((v & 1) == 0)
v /= 2;
}
return v;
}
Here is a direct approach for unsigned numbers, but the division might be more costly than the above iteration:
unsigned long long remove_trailing_zeroes(unsigned long long v) {
if (v != 0) {
// v and (v - 1) differ only in the trailing 0 bits plus 1
// shifting v ^ (v - 1) right by 1 and adding 1 gives the power of 2
// by which to divide v to remove all trailing 0 bits
v /= (((v ^ (v - 1)) >> 1) + 1);
}
return v;
}
harold suggested this simplification:
unsigned long long remove_trailing_zeroes(unsigned long long v) {
if (v != 0) {
// `-v`, which is `(~v + 1)` has all bits flipped except the least
// significant 1 bit.
// dividing v by `-v & v` shifts all trailing zero bits out,
v /= -v & v;
}
return v;
}
Which can be simplified as a single expression:
unsigned long long remove_trailing_zeroes(unsigned long long v) {
return v ? v / (-v & v) : v;
}
To avoid the division, you could count the number of bits in v ^ (v - 1)
with an efficient method and shift v
right by one less than this number. This would work for 0
as well so you would get branchless code.
You can find other methods in the fascinating word of Bit Twiddling Hacks.
Upvotes: 5
Reputation: 1055
If you use x86, then bsf
(or tzcnt
with BMI extension) can be used via, e.g., __builtin_ctzl
(supported on both GCC and Clang) to find the number of trailing zeros. Then removing them is just a matter of a single shift:
unsigned long long remove_trailing_zeroes_ctz(unsigned long long v) {
return v ? v >> __builtin_ctzl(v) : 0;
}
unsigned long long remove_trailing_zeroes_ctz_unchecked(unsigned long long v) {
return v >> __builtin_ctzl(v);
}
int test_ub() { // see below
return remove_trailing_zeroes_ctz(0) == remove_trailing_zeroes_ctz_unchecked(0);
}
godbolt (also contains the last remove_trailing_zeroes
approach from chqrlie's answer for comparison)
remove_trailing_zeroes:
mov rax, rdi
test rdi, rdi
je .L2
mov rcx, rdi
xor edx, edx
neg rcx
and rcx, rdi
div rcx
.L2:
ret
remove_trailing_zeroes_ctz:
xor ecx, ecx
mov rdx, rdi
mov rax, rdi
rep bsf rcx, rdi
shr rdx, cl
test rdi, rdi
cmovne rax, rdx
ret
remove_trailing_zeroes_ctz_unchecked:
xor ecx, ecx
mov rax, rdi
rep bsf rcx, rdi
shr rax, cl
ret
test_ub:
mov eax, 1
ret
remove_trailing_zeroes: # @remove_trailing_zeroes
test rdi, rdi
je .LBB0_1
mov rcx, rdi
neg rcx
and rcx, rdi
mov rax, rdi
shr rax, 32
je .LBB0_3
mov rax, rdi
xor edx, edx
div rcx
ret
.LBB0_1:
xor eax, eax
ret
.LBB0_3:
mov eax, edi
xor edx, edx
div ecx
ret
remove_trailing_zeroes_ctz: # @remove_trailing_zeroes_ctz
rep bsf rcx, rdi
mov rax, rdi
shr rax, cl
test rdi, rdi
cmove rax, rdi
ret
remove_trailing_zeroes_ctz_unchecked: # @remove_trailing_zeroes_ctz_unchecked
mov rax, rdi
rep bsf rcx, rdi
shr rax, cl
ret
test_ub: # @test_ub
ret
We can see that on both GCC and Clang the division version contains branches (1 on GCC and 2 on Clang because the latter performs a peculiar optimization where it tries, for better average performance, to use 32-bit div
for 64-bit division if arguments allow) and, obviously, the div
instruction, while __builtin_ctlz
versions have no heavyweight instructions like div
, but rather rep bsf
(see the next paragraph) with shr
as expected with only 1 additional conditional move for zero-checked version.
rep bsf
instructions you can see in the assembly are bsf
with rep
prefix. Due to the way this prefix (F3
) and bsf
(0F BC
) are coded and tzcnt
opcode was chosen, rep bsf
and tzcnt
correspond to the same 3-byte prefix-opcode F3 0F BC
in machine code if they are used in 32-bit mode or (as in our case) 4-byte F3 48 0F BC
with additional REX.W prefix (48
). This prefix-opcode is interpreted as tzcnt
on processors with BMI support and as just bsf
on processors without BMI support (in the latter case rep
prefix is just ignored as unsupported for bsf
instruction). Note that tzcnt
is sometimes faster than bsf
(described in some detail here) and they differ in how they handle zero input (namely, tzcnt
sets destination register to 32 or 64 depending on argument size, while bsf
leaves it with its old value). Due to the latter difference rep bsf
can't be trusted to have stable outputs for zero argument on processors with and without BMI, so rep bsf
uses are surrounded by zero checks. That's what -mbmi
option can usually help with - it tells compiler to assume that the code is run only on BMI-enabled processor where the same rep bsf
instruction can only be interpreted as tzcnt
and thus, if its default zero-handling behaviour (which is defined) suits us, the zero checks can be omitted. You can see the difference clearly in this example (though note that I did it a bit earlier for C++) by adding/removing -mbmi
option. Note that rep bsf
being replaced with tzcnt
in disassembler output when -mbmi
is present isn't very important since they still have the same prefix-opcodes. However, for our purposes the default tzcnt
behaviour wouldn't be sufficient even if we had the guarantee for __builtin_ctzl
to work as tzcnt
when -mbmi
is present (we don't) since, as explained below in "About zero-checking and UB", we would stumble into shift amount UB problem, so we still have to have a manual check for zero which surely doesn't go away when compiled with -mbmi
. The only useful thing -mbmi
option does for us, then, is actually reducing the size of the division version (remove_trailing_zeroes
) by two instructions by utilizing blsi
instruction (which basically does -v & v
).
Checking for zero is needed, strictly speaking, since the return value of __builtin_ctzl
is undefined for zero argument. You may think that when v == 0
it doesn't matter what we right-shift it with, but actually in both C and C++ it's undefined behaviour to shift (whether left or right) a number of (promoted) type T
by something outside of the range of [0; N - 1]
, where N
is the bit width of T
(i.e. sizeof(T) * CHAR_BIT
). You can see this UB manifest itself in how Clang treats 0 >> 32
being used as the return value of a function here (it just doesnt touch the rax
register and, thus, main
returns garbage). And __builtin_ctz(0)
may theoretically return anything, though if it uses tzcnt
it returns bit size of the type, which still would cause UB as a shift amount.
Practically, though, when shift amount isn't known at compile time, I doubt compilers would do anything other then use shr
instruction in an expected way while compiling remove_trailing_zeroes_ctz_unchecked
(itself; see below). And shr
itself has perfectly defined behaviour for all shift amounts (though it may seem unexpected and cause surprises like here), which is to use the lower 5 or 6 bits of the shift amount depending on whether 32-bit or 64-bit version of instruction is used (i.e. shift amount is taken mod 32 or 64 respectively). For 0, then, shr
by any amount results in 0, unlike div
by 0 which results in a #DE exception. You can see that the assembly for remove_trailing_zeroes_ctz_unchecked
is correct for both compilers and would work well if applied as is. However, it still can cause issues when evaluated with compile-time constant 0, as compiler evaluates (UB) function, not (correct wrt our intention) assembly code to which it compiles, which results in compile-time shift by compile-time evaluated __builtin_ctzl(0)
, which is either an indeterminate value or, maybe, internally actually is just 64 (in accordance with tzcnt
behaviour), which regardless results in easy-to-abuse UB (unlike when 0 is a runtime argument for our function which has to be compiled to work for general case and an obvious/optimal resulting assembly happens to work as intended for 0 as well for aforementioned reasons) which manifests itself in Clang behaviour (test_ub
compiles to return garbage instead of intended 1
as with GCC).
Summing up, if you can use intrinsics, I would recommend using remove_trailing_zeroes_ctz_unchecked
if you know your input can't be zero (e.g. it can be ensured as part of some prior input validation). If it can be zero, especially if the function may be evaluated at compile time (whether it's true is not always obvious considering all the constant propagation that may end up giving us compile-time constant as argument where it didn't look like that at first glance) and you are using Clang, then better use the remove_trailing_zeroes_ctz
version.
For C++ you could use std::countr_zero
(doc) since C++20 which functions as tzcnt
(including regarding zero argument) and allows to do the above nicely and portably (same considerations regarding shifting, checking for 0 and UB apply except that at least the result of countr_zero(0)
itself is defined to be size of argument type). C, unfortunately, doesn't have such standard functions as of now:
unsigned long long remove_trailing_zeros_std(unsigned long long v) {
return v ? v >> std::countr_zero(v) : 0;
}
unsigned long long remove_trailing_zeros_std_unchecked(unsigned long long v) {
return v >> std::countr_zero(v);
}
Upvotes: 0