Jim Buck
Jim Buck

Reputation: 20726

Windows/Intel and iOS/Arm differences in floating point calculations

In an update to Autodesk TinkerBox, I've come across an unexpected floating point calculation difference between our internal-only development version that runs on Windows and the version that runs on the our final target of iOS (the following info is based on a debug build running on iPad1).

We use Chipmunk for our physics needs. This is by no means likely to be the only calculation with this problem, but it's a particular one that I was analyzing:

static inline cpFloat
cpvcross(const cpVect &v1, const cpVect &v2)
{
    return v1.x*v2.y - v1.y*v2.x;
}

The particular case I'm looking at has v1 as (0xC0A7BC40 [-5.241729736328125], 0xC0E84C80 [-7.25933837890625]) and v2 as (0x428848FB [68.14253997802734], 0x42BCBE40 [94.37158203125]). I focus on the hex versions of the values since those are the exact values that are the inputs on both platforms, verified by inspecting the memory locations of v1 and v2 on both platforms. For reference, the floating point values in brackets were grabbed from putting the hex values into this site.

On Windows the result is 0xBA15F8E8 [-0.0005720988847315311], and on iOS the result is 0xBA100000 [-0.00054931640625]. Of course the difference is small, but not really when you consider percentage-wise, and it accumulates over time to show deviations in the behavior of the physics. (Please do not suggest using doubles. It slows the game down, of course, and not using doubles is not the issue here. :) )

For reference, this is a debug build on both platforms, and the code compiles as:

Windows

static inline cpFloat
cpvcross(const cpVect &v1, const cpVect &v2)
{
01324790  push        ebp  
01324791  mov         ebp,esp 
01324793  sub         esp,0C4h 
01324799  push        ebx  
0132479A  push        esi  
0132479B  push        edi  
0132479C  lea         edi,[ebp-0C4h] 
013247A2  mov         ecx,31h 
013247A7  mov         eax,0CCCCCCCCh 
013247AC  rep stos    dword ptr es:[edi] 
    return v1.x*v2.y - v1.y*v2.x;
013247AE  mov         eax,dword ptr [v1] 
013247B1  fld         dword ptr [eax] 
013247B3  mov         ecx,dword ptr [v2] 
013247B6  fmul        dword ptr [ecx+4] 
013247B9  mov         edx,dword ptr [v1] 
013247BC  fld         dword ptr [edx+4] 
013247BF  mov         eax,dword ptr [v2] 
013247C2  fmul        dword ptr [eax] 
013247C4  fsubp       st(1),st 
013247C6  fstp        dword ptr [ebp-0C4h] 
013247CC  fld         dword ptr [ebp-0C4h] 
}
013247D2  pop         edi  
013247D3  pop         esi  
013247D4  pop         ebx  
013247D5  mov         esp,ebp 
013247D7  pop         ebp  
013247D8  ret              

iOS

invent`cpvcross at cpVect.h:63:
0x94a8:  sub    sp, sp, #8
0x94ac:  str    r0, [sp, #4]
0x94b0:  str    r1, [sp]
0x94b4:  ldr    r0, [sp, #4]
0x94b8:  vldr   s0, [r1]
0x94bc:  vldr   s1, [r1, #4]
0x94c0:  vldr   s2, [r0]
0x94c4:  vldr   s3, [r0, #4]
0x94c8:  vmul.f32 s1, s2, s1
0x94cc:  vmul.f32 s0, s3, s0
0x94d0:  vsub.f32 s0, s1, s0
0x94d4:  vmov   r0, s0
0x94d8:  add    sp, sp, #8
0x94dc:  bx     lr   

As near as I can tell, those calculations are identical, assuming each instruction is calculating the result of the operands identically. Xcode does not allow me to step over instruction-by-instruction, for some reason (which Visual Studio does allow), so I can't narrow down which instruction(s) are deviating compared to the Intel FP unit.

So, why is the result of such a simply calculation so different between the two CPUs?

Upvotes: 4

Views: 3181

Answers (1)

Igor Skochinsky
Igor Skochinsky

Reputation: 25278

You're seeing the results of using different floating-point precision for calculations.

In the x86 code, the calculations are done in the FPU registers which are extended precision (80 bits), while NEON code uses floats (32-bit). Apparently the extra precision during multiplication and subtraction allows x86 code to retain more bits while ARM code loses them.

Using the _controlfp functions you can tell the FPU to use a specific precision for all calculations. I made a small program using the example from MSDN and was able to get the same result as ARM code:

#include <stdio.h>
typedef float cpFloat;
struct cpVect  {cpFloat x, y;};
struct cpVectI {unsigned int x, y;};
union cpv {cpVectI i; cpVect f;};
union cfi { float f; unsigned int i;};

cpFloat cpvcross(const cpVect &v1, const cpVect &v2)
{
    return v1.x*v2.y - v1.y*v2.x;
}

#include <float.h>
#pragma fenv_access (on)

void main(void)
{
  cpv v1, v2;
  cfi fi;
  v1.i.x = 0xC0A7BC40;
  v1.i.y = 0xC0E84C80;
  v2.i.x = 0x428848FB;
  v2.i.y = 0x42BCBE40;

  unsigned int control_word_x87;

  // Show original x87 control word and do calculation.
  __control87_2(0, 0, &control_word_x87, 0);
  printf( "Original: 0x%.4x\n", control_word_x87 );
  fi.f = cpvcross(v1.f, v2.f);
  printf("Result: %g (0x%08X)\n", fi.f, fi.i);

  // Set precision to 24 bits and recalculate.
  __control87_2(_PC_24, MCW_PC, &control_word_x87, 0);
  printf( "24-bit:   0x%.4x\n", control_word_x87);
  fi.f = cpvcross(v1.f, v2.f);
  printf("Result: %g (0x%08X)\n", fi.f, fi.i);

  // Restore default precision-control bits and recalculate.
  __control87_2( _CW_DEFAULT, MCW_PC, &control_word_x87, 0);
  printf( "Default:  0x%.4x\n", control_word_x87 );
  fi.f = cpvcross(v1.f, v2.f);
  printf("Result: %g (0x%08X)\n", fi.f, fi.i);
}

Here's the output:

Original: 0x9001f
Result: -0.000572099 (0xBA15F8E8)
24-bit:   0xa001f
Result: -0.000549316 (0xBA100000)
Default:  0x9001f
Result: -0.000572099 (0xBA15F8E8)

Be careful when using this function and calling external libraries; some code might be relying on the default settings and will break if you change them behind its back.

Another option could be to switch to SSE intrinsics which will use a specific precision. Unfortunately, /arch:SSE2 doesn't seem to make use of SSE2 for floating point (at least in VS2010).

Upvotes: 4

Related Questions