Entropy
Entropy

Reputation: 295

Inconsistent behaviour in identical code

An error trap trips after running a physics simulation for about 20 minutes. Realising this would be a pain to debug, I duplicated the relevant subroutine in a new project, and called it with hard-coded copies of the original input data at the moment the error occurred. But the error trap did not trip! After two days of tedious work to isolate the exact point where the behaviours of the two instances of the subroutine diverge, I’ve traced the problem to a very simple function for computing a Cross_Product.

This Cross_Product function is identical in both programs. I have even checked the disassembly and made sure that the compiler is producing identical code. The function is also receiving identical input data in both cases. I have even explicitly checked the rounding mode inside the functions and they are identical. Yet, they are returning slightly different results. Specifically, the LSB is different for two out of three of the returned vector components. Even the debugger itself confirms that these two out of three variables are not equal to the expressions they’ve been explicitly assigned. (See screenshot below.)

Debug screenshot

In the original program, the debugger shows “true” in all of the last three lines of the watch list instead of only the last one.

I’m using Code::Blocks 13.12 with the GCC Compiler on XP, with an AMD Athlon 64 CPU. However, I recompiled and ran the test program in Code::Blocks 16.01 on a much more modern Windows 10 machine, with an Intel Core i5 CPU, and the results were identical.

Here is my minimal, complete, and verifiable code to reproduce the bizarre result which disagrees with my original program AND with the debugger itself (unfortunately, I can’t include the original physics program because it’s HUGE):

extern "C" {
    __declspec(dllimport) int __stdcall IsDebuggerPresent(void);
    __declspec(dllimport) void __stdcall DebugBreak(void);
}

struct POLY_Triplet {
   double XYZ[3];
};

POLY_Triplet Cross_Product(POLY_Triplet Vector1, POLY_Triplet Vector2) {
   POLY_Triplet Result;

   Result.XYZ[0] = Vector1.XYZ[1] * Vector2.XYZ[2] - Vector1.XYZ[2] * Vector2.XYZ[1];
   Result.XYZ[1] = Vector1.XYZ[2] * Vector2.XYZ[0] - Vector1.XYZ[0] * Vector2.XYZ[2];
   Result.XYZ[2] = Vector1.XYZ[0] * Vector2.XYZ[1] - Vector1.XYZ[1] * Vector2.XYZ[0];

   return Result;
}

int main() {
   POLY_Triplet Triplet1;

   POLY_Triplet Collision_Axis_Vector;

   POLY_Triplet Boundary_normal;

   *(long long int *)(&Collision_Axis_Vector.XYZ[0]) = 4594681439063077250;
   *(long long int *)(&Collision_Axis_Vector.XYZ[1]) = 4603161398996347097;
   *(long long int *)(&Collision_Axis_Vector.XYZ[2]) = 4605548671330989714;

   *(long long int *)(&Triplet1.XYZ[0]) = -4626277815076045984;
   *(long long int *)(&Triplet1.XYZ[1]) = -4637257536736295424;
   *(long long int *)(&Triplet1.XYZ[2]) = 4589609575355367200;

   if (IsDebuggerPresent()) {
      DebugBreak();
   }

   Boundary_normal = Cross_Product(Collision_Axis_Vector, Triplet1);

   return 0;
}

For convenience, here are the relevant lines for the watch list, as seen in the screenshot:

(Result.XYZ[0] == Vector1.XYZ[1] * Vector2.XYZ[2] - Vector1.XYZ[2] * Vector2.XYZ[1])
(Result.XYZ[1] == Vector1.XYZ[2] * Vector2.XYZ[0] - Vector1.XYZ[0] * Vector2.XYZ[2])
(Result.XYZ[2] == Vector1.XYZ[0] * Vector2.XYZ[1] - Vector1.XYZ[1] * Vector2.XYZ[0])

Can anyone please explain this behaviour?

Upvotes: 2

Views: 633

Answers (3)

Soonts
Soonts

Reputation: 21966

I can confirm the problematic output you’re getting can be caused by a change in x87 precision. The precision value is stored in x87 FPU control register, and when changed, the value persists through the lifetime of your thread, affecting all x87 code running on the thread.

Apparently, some other component of your huge program (or an external library you use) sometimes changes mantissa length from 53 bits (which is the default) to 64 bits (which means use the full precision of these 80 bit x87 registers).

The best way to fix, switch your compiler from x87 to SSE2 target. SSE always use either 32 or 64-bit floats (depending on the instructions used), it doesn’t have 80-bit registers at all. Even your 2003 Athlon 64 already supports that instruction set. As a side effect, your code will become somewhat faster.

Update: If you don’t want to switch to SSE2, you can reset the precision to whatever value you like. Here’s how to do that in Visual C++:

#include <float.h>
uint32_t prev;
_controlfp_s( &prev, _PC_53, _MCW_PC ); // or _PC_64 for 80-bit

For GCC, it’s something like this (untested)

#include <fpu_control.h>
#define _FPU_PRECISION ( _FPU_SINGLE | _FPU_DOUBLE | _FPU_EXTENDED )
fpu_control_t prev, curr;
_FPU_GETCW( prev );
curr = ( prev & ~_FPU_PRECISION ) | _FPU_DOUBLE; // or _FPU_EXTENDED for 80 bit
_FPU_SETCW( curr );

Upvotes: 2

Soonts
Soonts

Reputation: 21966

Compiled your sample with Visual C++. I can confirm the output is slightly different from what you see in the debugger, here’s mine:

CAV: 4594681439063077250, 4603161398996347097, 4605548671330989714
T1: -4626277815076045984, -4637257536736295424, 4589609575355367200
CP: 4589838838395290724, -4627337114727508684, 4592984408164162561

I don’t know for sure what might cause the difference, but here’s an idea.

Since you’ve already looked at the machine code, what are you compiling into, legacy x87 or SSE? I presume it’s SSE, most compilers target this by default, for years already. If you pass -march native to gcc, very likely your CPU has some FMA instruction set (AMD since late 2011, Intel since 2013). Therefore, your GCC compiler used these _mm_fmadd_pd / _mm_fmsub_pd intrinsics, causing your 1-bit difference.

However, that’s all theory. My advice is, instead of trying to find out what caused that difference, you should fix your outer code.

It’s a bad idea to trap to debugger as the result of condition like this.

The numerical difference is very small. That’s the least significant bit in a 52-bit mantissa, i.e. the error is just 2^(-52).

Even if you’ll find out what caused that, disable e.g. FMA or some other thing that caused the issue, doing that is fragile, i.e. you’re going to support that hack through the lifetime of the project. You’ll upgrade your compiler, or the compiler will decide to optimize your code differently, or even you’ll upgrade the CPU — your code may break in the similar way.

Better approach, just stop comparing floating point numbers for exact equality. Instead, calculate e.g. absolute difference, and compare that with a small enough constant.

Upvotes: 2

bolov
bolov

Reputation: 75894

*(long long int *)(&Collision_Axis_Vector.XYZ[0]) = 4594681439063077250;

and all similar lines introduce Undefined Behavior in the program because they violate the Strict Aliasing rule:

you access a double value as long long int

Upvotes: 3

Related Questions