Reputation: 8540
Is there any clever bit trick to detect if any of a small number of integers (say 3 or 4) has a specific value?
The straightforward
bool test(int a, int b, int c, int d)
{
// The compiler will pretty likely optimize it to (a == d | b == d | c == d)
return (a == d || b == d || c == d);
}
test(int, int, int, int):
cmp ecx, esi
sete al
cmp ecx, edx
sete dl
or eax, edx
cmp edi, ecx
sete dl
or eax, edx
ret
Those sete
instructions have higher latency than I want to tolerate, so I would rather use something bitwise (&
, |
, ^
, ~
) stuff and a single comparison.
Upvotes: 16
Views: 618
Reputation: 28837
It is not a full bit trick. Any zero yields a zero product, which gives a zero result. Negate 0 yields a 1. Does not deal with overflow.
bool test(int a, int b, int c, int d)
{
return !((a^d)*(b^d)*(c^d));
}
gcc 7.1 -O3
output. (d
is in ecx
, the other inputs start in other integer regs).
xor edi, ecx
xor esi, ecx
xor edx, ecx
imul edi, esi
imul edx, edi
test edx, edx
sete al
ret
It might be faster than the original on Core2 or Nehalem where partial-register stalls are a problem. imul r32,r32
has 3c latency on Core2/Nehalem (and later Intel CPUs), and 1 per clock throughput, so this sequence has 7 cycle latency from the inputs to the 2nd imul
result, and another 2 cycles of latency for test
/sete
. Throughput should be fairly good if this sequence runs on multiple independent inputs.
Using a 64-bit multiply would avoid the overflow problem on the first multiply, but the second could still overflow if the total is >= 2**64
. It would still be the same performance on Intel Nehalem and Sandybridge-family, and AMD Ryzen. But it would be slower on older CPUs.
In x86 asm, doing the second multiply with a full-multiply one-operand mul
instruction (64x64b => 128b) would avoid overflow, and the result could be checked for being all-zero or not with or rax,rdx
. We can write that in GNU C for 64-bit targets (where __int128
is available)
bool test_mulwide(unsigned a, unsigned b, unsigned c, unsigned d)
{
unsigned __int128 mul1 = (a^d)*(unsigned long long)(b^d);
return !(mul1*(c^d));
}
and gcc/clang really do emit the asm we hoped for (each with some useless mov
instructions):
# gcc -O3 for x86-64 SysV ABI
mov eax, esi
xor edi, ecx
xor eax, ecx
xor ecx, edx # zero-extends
imul rax, rdi
mul rcx # 64 bit inputs (rax implicit), 128b output in rdx:rax
mov rsi, rax # this is useless
or rsi, rdx
sete al
ret
This should be almost as fast as the simple version that can overflow, on modern x86-64. (mul r64
is still only 3c latency, but 2 uops instead of 1 for imul r64,r64
that doesn't produce the high-half), on Intel Sandybridge-family.)
It's still probably worse than clang's setcc
/or
output from the original version, which uses 8-bit or
instructions to avoid reading 32-bit registers after writing the low byte (i.e. no partial-register stalls).
See both sources with both compilers on the Godbolt compiler explorer. (Also included: @BeeOnRope's ^
/ &
version that risks false positives, with and without a fallback to a full check.)
Upvotes: 3
Reputation: 24146
The only solution I've found yet is:
int s1 = ((a-d) >> 31) | ((d-a) >> 31);
int s2 = ((b-d) >> 31) | ((d-b) >> 31);
int s3 = ((c-d) >> 31) | ((d-c) >> 31);
int s = s1 & s2 & s3;
return (s & 1) == 0;
alternative variant:
int s1 = (a-d) | (d-a);
int s2 = (b-d) | (d-b);
int s3 = (c-d) | (d-c);
int s = (s1 & s2 & s3);
return (s & 0x80000000) == 0;
both are translated to:
mov eax, ecx
sub eax, edi
sub edi, ecx
or edi, eax
mov eax, ecx
sub eax, esi
sub esi, ecx
or esi, eax
and esi, edi
mov eax, edx
sub eax, ecx
sub ecx, edx
or ecx, eax
test esi, ecx
setns al
ret
which has less sete instructions, but obviously more mov/sub.
Update: as BeeOnRope@ suggested - it makes sense to cast input variables to unsigned
Upvotes: 5