Reputation: 3284
I was looking for a SIMD option to speed up comparisons and I found the function __m128d _mm_cmpgt_sd (__m128d a, __m128d b)
Apparently it compares the lower double, and copies the higher double from a
into the output. What it is doing makes sense, but what's the point? What problem is this trying to solve?
Upvotes: 3
Views: 414
Reputation: 364458
cmpsd
is an instruction that exists in asm and operates on XMM registers, so it would be inconsistent not to expose it via intrinsics.
(Almost all packed-FP instructions (other than shuffles/blends) have a scalar version, so again there's a consistency argument for ISA design; it's just an extra prefix to the same opcode, and might require more transistors to special-case that opcode not supporting a scalar version.)
Whether or not you or the people designing the intrinsics API could think of a reasonable use-case is not at all the point. It would be foolish to leave things out on that basis; when someone comes up with a use-case they'll have to use inline asm or write C that compiles to more instructions.
Perhaps someone someday will find a use-case for a vector with a mask as the low half, and a still-valid double
in the high half. e.g. maybe _mm_and_ps
back onto the input to conditionally zero just the low element without needing a packed-compare in the high element to produce true.
Or consider that all-ones is a bit-pattern for NaN, and all-zero is the bit-pattern for +0.0
.
IIRC, cmppd
slows down if any of the elements are subnormal (if you don't have the DAZ bit set in MXCSR). At least on some older CPUs that existed when the ISA was being designed. So for FP compares, having scalar versions is (or was) essential for avoiding spurious FP assists for elements you don't care about.
Also for avoiding spurious FP exceptions (or setting exception flags if they're masked), like if there's a NaN in the upper element of either vector.
@wim also makes a good point that Intel CPUs before Core2 decoded 128-bit SIMD instructions to 2 uops, one for each 64-bit half. So using cmppd
when you don't need the high half result would always be slower, even if it can't fault. Lots of multi-uop instructions can easily bottleneck the front-end decoders on CPUs without a uop-cache, because only one of the decoders can handle them.
You don't normally use intrinsics for FP scalar instructions like cmpsd
or addsd
, but they exist in case you want them (e.g. as the last step in a horizontal sum). More often you just leave it to the compiler to use scalar versions of instructions when compiling scalar code without auto-vectorization.
And often for scalar compares, compilers will want the result in EFLAGS so will use ucomisd
instead of creating a compare mask, but for branchless code a mask is often useful, e.g. for a < b ? c : 0.0
with cmpsd
and andpd
. (Or really andps
because it's shorter and does the same thing as the pointless andpd
.)
Upvotes: 3
Reputation: 3968
The point is probably that on very old hardware, such as, e.g., Intel Pentium II and III, _mm_cmpgt_sd()
is faster than _mm_cmpgt_pd()
. See Agner Fog's instruction tables. These processors (PII and PIII) only have a 64-bit wide floating point unit. 128-bit wide SSE instructions are executed as two 64-bit micro-ops on these processors. On newer CPUs (such as for example intel Core 2 (Merom) and newer) the _pd
and _ps
versions are as fast as the _sd
and _ss
versions. So, you might prefer the _sd
and _ss
versions if you only have to compare a single element and don't care about the upper 64 bits of the result.
Moreover, _mm_cmpgt_pd()
may raise a spurious floating point exception or suffer from degraded performance, if the upper garbage bits contain accidently a NaN
or a subnormal number, see Peter Cordes’ answer. Although, in practice, it should be easy to avoid such upper garbage bits when programming with intrinsics.
If you want to vectorize your code, and need a packed double compare, then use intrinsic _mm_cmpgt_pd()
, instead of _mm_cmpgt_sd()
.
Upvotes: 5