plasmacel
plasmacel

Reputation: 8540

Bit trick to detect if any of some integers has a specific value

Is there any clever bit trick to detect if any of a small number of integers (say 3 or 4) has a specific value?

The straightforward

bool test(int a, int b, int c, int d)
{
    // The compiler will pretty likely optimize it to (a == d | b == d | c == d)
    return (a == d || b == d || c == d);
}

in GCC compiles to

test(int, int, int, int):
        cmp     ecx, esi
        sete    al
        cmp     ecx, edx
        sete    dl
        or      eax, edx
        cmp     edi, ecx
        sete    dl
        or      eax, edx
        ret

Those sete instructions have higher latency than I want to tolerate, so I would rather use something bitwise (&, |, ^, ~) stuff and a single comparison.

Upvotes: 16

Views: 618

Answers (2)

EvilTeach
EvilTeach

Reputation: 28837

It is not a full bit trick. Any zero yields a zero product, which gives a zero result. Negate 0 yields a 1. Does not deal with overflow.

bool test(int a, int b, int c, int d)
{
    return !((a^d)*(b^d)*(c^d));
}

gcc 7.1 -O3 output. (d is in ecx, the other inputs start in other integer regs).

    xor     edi, ecx
    xor     esi, ecx
    xor     edx, ecx
    imul    edi, esi
    imul    edx, edi
    test    edx, edx
    sete    al
    ret

It might be faster than the original on Core2 or Nehalem where partial-register stalls are a problem. imul r32,r32 has 3c latency on Core2/Nehalem (and later Intel CPUs), and 1 per clock throughput, so this sequence has 7 cycle latency from the inputs to the 2nd imul result, and another 2 cycles of latency for test/sete. Throughput should be fairly good if this sequence runs on multiple independent inputs.

Using a 64-bit multiply would avoid the overflow problem on the first multiply, but the second could still overflow if the total is >= 2**64. It would still be the same performance on Intel Nehalem and Sandybridge-family, and AMD Ryzen. But it would be slower on older CPUs.

In x86 asm, doing the second multiply with a full-multiply one-operand mul instruction (64x64b => 128b) would avoid overflow, and the result could be checked for being all-zero or not with or rax,rdx. We can write that in GNU C for 64-bit targets (where __int128 is available)

bool test_mulwide(unsigned a, unsigned b, unsigned c, unsigned d)
{
    unsigned __int128 mul1 = (a^d)*(unsigned long long)(b^d);
    return !(mul1*(c^d));
}

and gcc/clang really do emit the asm we hoped for (each with some useless mov instructions):

   # gcc -O3 for x86-64 SysV ABI
    mov     eax, esi
    xor     edi, ecx
    xor     eax, ecx
    xor     ecx, edx   # zero-extends
    imul    rax, rdi
    mul     rcx        # 64 bit inputs (rax implicit), 128b output in rdx:rax
    mov     rsi, rax   # this is useless
    or      rsi, rdx
    sete    al
    ret

This should be almost as fast as the simple version that can overflow, on modern x86-64. (mul r64 is still only 3c latency, but 2 uops instead of 1 for imul r64,r64 that doesn't produce the high-half), on Intel Sandybridge-family.)


It's still probably worse than clang's setcc/or output from the original version, which uses 8-bit or instructions to avoid reading 32-bit registers after writing the low byte (i.e. no partial-register stalls).

See both sources with both compilers on the Godbolt compiler explorer. (Also included: @BeeOnRope's ^ / & version that risks false positives, with and without a fallback to a full check.)

Upvotes: 3

Iłya Bursov
Iłya Bursov

Reputation: 24146

The only solution I've found yet is:

int s1 = ((a-d) >> 31) | ((d-a) >> 31);
int s2 = ((b-d) >> 31) | ((d-b) >> 31);
int s3 = ((c-d) >> 31) | ((d-c) >> 31);

int s = s1 & s2 & s3;
return (s & 1) == 0;

alternative variant:

int s1 = (a-d) | (d-a);
int s2 = (b-d) | (d-b);
int s3 = (c-d) | (d-c);

int s = (s1 & s2 & s3);
return (s & 0x80000000) == 0;

both are translated to:

mov     eax, ecx
sub     eax, edi
sub     edi, ecx
or      edi, eax
mov     eax, ecx
sub     eax, esi
sub     esi, ecx
or      esi, eax
and     esi, edi
mov     eax, edx
sub     eax, ecx
sub     ecx, edx
or      ecx, eax
test    esi, ecx
setns   al
ret

which has less sete instructions, but obviously more mov/sub.

Update: as BeeOnRope@ suggested - it makes sense to cast input variables to unsigned

Upvotes: 5

Related Questions