c++optimizationfloating-pointcomparisonzero

Reputation: 15090

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:

bool IsZero(const double x) {
  return fabs(x) == +0.0;
}

Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?

If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.

Upvotes: 2

Answers (3)

Cody Gray

Reputation: 244971

Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.

The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.

This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.

This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).

For a single-precision floating-point value (i.e., float), which is 32-bits wide:

mov   eax, DWORD PTR [floatingPointValue]
add   eax, eax        ; shift out the sign bit to ignore -0.0
sete  al              ; set AL if the remaining bits were 0

Translating this into C code, you'd do something like:

const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);

Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.

If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:

  ;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
  ;; but necessary if you cannot assume SSE2 support
  fld      DWORD PTR [floatingPointValue]
  fldz
  fucompp
  fnstsw   ax
  test     ah, 44h
  jp       IsNonZero
  mov      al, 1
  ret
IsNonZero:
  xor      al, al
  ret

  ;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
  movss    xmm0, DWORD PTR [floatingPointValue]
  ucomiss  xmm0, DWORD PTR [constantZero]
  lahf
  test     ah, 44h
  jp       IsNonZero
  mov      al, 1
  ret
IsNonZero:
  xor      al, al
  ret

Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.

An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.

If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.

The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:

  ;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
  ;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
  ;; The floating point value being compared is passed directly in XMM0
  ucomiss   xmm0, DWORD PTR [constantZero]
  lahf
  test      ah, 44h
  jp       IsNonZero
  mov      al, 1
  ret
IsNonZero:
  xor      al, al
  ret

I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.

If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:

return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);

Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.

I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.

That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).

Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:

mov    eax, DWORD PTR [floatingPointValue+4]  ; load upper bits only
add    eax, eax        ; shift out sign bit to ignore -0.0
sete   al              ; set AL if the remaining bits were 0

In unsafe C:

const uint64_t bits      = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);

If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.

If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):

return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);

Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.

Upvotes: 5

Todor Balabanov

Reputation: 388

If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.

In GCC compiler you can keep intermediate files (including assembler version) by this way:

gcc -save-temps main.cpp

Upvotes: 0

Sahil Shah

Reputation: 97

In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:

x == 0.0

From the cmath library you can use:

int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0

Upvotes: 2

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

Answers (3)

Related Questions