Reputation: 13749
Can FP compares like SSE2 _mm_cmpeq_pd
/ AVX _mm_cmp_pd
be used to compare 64 bit integers?
The idea is to emulate missing _mm_cmpeq_epi64
that would be similar to _mm_cmpeq_epi8
, _mm_cmpeq_epi16
, _mm_cmpeq_epi32
.
The concern is I'm not sure if the comparison is bitwise, or handles floating point specifically, like NAN values are always unequal.
Upvotes: 3
Views: 329
Reputation: 365757
AVX implies availability of SSE4.1 pcmpeqq
is available, in that case you should just use _mm_cmpeq_epi64
.
FP compares treat NaN != NaN, and -0.0 == +0.0
, and if DAZ is set in MXCSR, treat any small integer as zero. (Because exponent = 0 means it represents a denormal, and Denormals-Are-Zero mode treats them as exactly zero on input to avoid possible speed penalties for any operations on any microarchitecture, including for compares. IIRC, modern microarchitectures don't have a penalty for subnormal inputs to compares, but do still for some other operations. In any case, programs built with -ffast-math
set FTZ and DAZ for the main thread on startup.)
So FP compares are not really usable for integers unless you know that some but not all of bits [62:52] (inclusive) will be set.
It's much to use pcmpeqd
(_mm_cmpeq_epi32
) than to hack up some FP bit-manipulation. (Although @chtz suggested in comments you could do 42.0 == (42.0 ^ (a^b))
with xorpd
, as long as the compiler doesn't optimize away the constant and compare against 0.0. That's a GCC bug without -ffast-math).
If you want a condition like at-least-one-match then you need to make sure both halves of a 64-bit element matched, like mask & (mask<<1)
on a movmskps
result, which can compile to lea
/ test
. (You could mask & (mask<<4)
on a pmovmskb
result, but that's slightly less efficient because LEA copy-and-shift can only shift by 0..3.)
Of course "all-matched" doesn't care about element sizes so you can just use _mm_movemask_epi8
on any compare result, and check it against 0xFFFF
.
If you want to use it for a blend with and/andnot/or, you can pshufd
/ pand
to swap halves within 64-bit elements. (If you were feeding pblendvb
or blendvpd
, that would mean SSE4.1 was available so you should have used pcmpeqq
.)
The more expensive one to emulate is SSE4.2 pcmpgtq
, although I think GCC and/or clang do know how to emulate it when auto-vectorizing.
Upvotes: 3