Reputation: 30615
I have a struct like this:
struct {
uint32_t a;
uint16_t b;
uint16_t c;
uint16_t d;
uint8_t e;
} s;
and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs.
Could somebody please advise the best approach? I can add a union to my struct to make processing easier.
I am limited (for now) to using SSE4.2, but any AVX answers would be welcome too if they are significantly faster. I am using GCC 4.8.2
Upvotes: 3
Views: 1157
Reputation: 364220
What @zx485 should have written is:
.data
mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
pxor xmm1, xmm2 ; equiv to psubb, but runs on all 3 vector execution ports
ptest xmm1, xmmword ptr [mask11byte] ; SSE 4.1
setz al ; AL=TRUE for equal
As long as nothing bad happens (floating point exceptions), you don't need to mask off your operands before computation, even if they hold garbage. And since PTEST
does a bitwise AND as part of its operation, you don't need a separate PAND
at all.
For a while, I thought I had a version that could use less space and fewer uops, but I ended up needing an extra instruction because there's no pcmpneq
(so I needed a logical not
). So it's smaller, the same number of uops, but significantly worse latency.
.code
PCMPEQB xmm1, xmm2 ; bytes of xmm1 = 0xFF on equal
PMOVMSKB eax, xmm1 ; ax = high bit of each byte of xmm1
NOT eax
TEST eax, 0x7FF ; zero flag set if all the low 11 bits are zero
SETZ al ; 17 bytes
; Or one fewer insn with BMI1's ANDN. One fewer uop if test can't macro-fuse
ANDN eax, eax, [mask11bits] ; only test the low 11 bits.
; ANDN version takes 20 bytes, plus 2B of data
.data
mask11bits dw 07ffh
test
can macro-fuse with jcc
, so if you're using this as a jump condition instead of actually doing setz
, you come out ahead on size. (since you don't need the 16B mask constant.)
ptest
takes 2 uops, so the ptest
version is 4 uops total (including the jcc
or other instruction). The pmovmskb
version is also 4 uops with a test
/jcc
macro-fused branch, but 5 with cmovcc
/ setcc
. (4 with andn
, with any of setcc
/ cmovcc
/ jcc
since it can't macro-fuse`.)
(Agner Fog's table says ptest
takes 1 fused-domain uop on Sandybridge, 2 on all other Intel CPUs that support it. I'm not sure I believe that, though.)
Latency on Haswell (important if the branch doesn't predict well):
pxor
: 1 + ptest
: 2 = 3 cyclespcmpeqb
: 1 + pmovmskb
: 3 + not
: 1 + test
: 1 = 6 cyclespcmpeqb
: 1 + pmovmskb
: 3 + andn
: 1 = 5 cycles (but not macro-fused, so possibly 1 more cycle of latency?)So the ptest
version has significantly shorter latency: jcc
can execute sooner, to detect branch mispredicts sooner.
Agner Fog's tests show ptest
has latency = 3 on Nehalem, 1 on SnB/IvB, 2 on Haswell.
Upvotes: 2
Reputation: 29022
A simple solution would be to just subtract the two structs byte wise after masking so you get an all-zero-value only if all packed bytes are identical. This code is in MASM format, but you surely can adapt that to gcc AT&T syntax or intrinsicals:
.data
mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
pand xmm1, xmmword ptr [mask11byte]
pand xmm2, xmmword ptr [mask11byte]
psubb xmm1, xmm2
ptest xmm1, xmm1 ; SSE 4.1
setz al ; AL=TRUE for equal
Addition: Because the size of the struct is 11 byte, 256bit/32byte-AVX(x) would make no sense.
Upvotes: 0