Jimbo
Jimbo

Reputation: 3284

What's the point of _mm_cmpgt_sd and other similar methods?

I was looking for a SIMD option to speed up comparisons and I found the function __m128d _mm_cmpgt_sd (__m128d a, __m128d b)

Apparently it compares the lower double, and copies the higher double from a into the output. What it is doing makes sense, but what's the point? What problem is this trying to solve?

Upvotes: 3

Views: 414

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364458

cmpsd is an instruction that exists in asm and operates on XMM registers, so it would be inconsistent not to expose it via intrinsics.

(Almost all packed-FP instructions (other than shuffles/blends) have a scalar version, so again there's a consistency argument for ISA design; it's just an extra prefix to the same opcode, and might require more transistors to special-case that opcode not supporting a scalar version.)

Whether or not you or the people designing the intrinsics API could think of a reasonable use-case is not at all the point. It would be foolish to leave things out on that basis; when someone comes up with a use-case they'll have to use inline asm or write C that compiles to more instructions.

Perhaps someone someday will find a use-case for a vector with a mask as the low half, and a still-valid double in the high half. e.g. maybe _mm_and_ps back onto the input to conditionally zero just the low element without needing a packed-compare in the high element to produce true.

Or consider that all-ones is a bit-pattern for NaN, and all-zero is the bit-pattern for +0.0.


IIRC, cmppd slows down if any of the elements are subnormal (if you don't have the DAZ bit set in MXCSR). At least on some older CPUs that existed when the ISA was being designed. So for FP compares, having scalar versions is (or was) essential for avoiding spurious FP assists for elements you don't care about.

Also for avoiding spurious FP exceptions (or setting exception flags if they're masked), like if there's a NaN in the upper element of either vector.

@wim also makes a good point that Intel CPUs before Core2 decoded 128-bit SIMD instructions to 2 uops, one for each 64-bit half. So using cmppd when you don't need the high half result would always be slower, even if it can't fault. Lots of multi-uop instructions can easily bottleneck the front-end decoders on CPUs without a uop-cache, because only one of the decoders can handle them.


You don't normally use intrinsics for FP scalar instructions like cmpsd or addsd, but they exist in case you want them (e.g. as the last step in a horizontal sum). More often you just leave it to the compiler to use scalar versions of instructions when compiling scalar code without auto-vectorization.

And often for scalar compares, compilers will want the result in EFLAGS so will use ucomisd instead of creating a compare mask, but for branchless code a mask is often useful, e.g. for a < b ? c : 0.0 with cmpsd and andpd. (Or really andps because it's shorter and does the same thing as the pointless andpd.)

Upvotes: 3

wim
wim

Reputation: 3968

The point is probably that on very old hardware, such as, e.g., Intel Pentium II and III, _mm_cmpgt_sd() is faster than _mm_cmpgt_pd(). See Agner Fog's instruction tables. These processors (PII and PIII) only have a 64-bit wide floating point unit. 128-bit wide SSE instructions are executed as two 64-bit micro-ops on these processors. On newer CPUs (such as for example intel Core 2 (Merom) and newer) the _pd and _ps versions are as fast as the _sd and _ss versions. So, you might prefer the _sd and _ss versions if you only have to compare a single element and don't care about the upper 64 bits of the result.

Moreover, _mm_cmpgt_pd() may raise a spurious floating point exception or suffer from degraded performance, if the upper garbage bits contain accidently a NaN or a subnormal number, see Peter Cordes’ answer. Although, in practice, it should be easy to avoid such upper garbage bits when programming with intrinsics.

If you want to vectorize your code, and need a packed double compare, then use intrinsic _mm_cmpgt_pd(), instead of _mm_cmpgt_sd().

Upvotes: 5

Related Questions