Reputation: 20726
In an update to Autodesk TinkerBox, I've come across an unexpected floating point calculation difference between our internal-only development version that runs on Windows and the version that runs on the our final target of iOS (the following info is based on a debug build running on iPad1).
We use Chipmunk for our physics needs. This is by no means likely to be the only calculation with this problem, but it's a particular one that I was analyzing:
static inline cpFloat
cpvcross(const cpVect &v1, const cpVect &v2)
{
return v1.x*v2.y - v1.y*v2.x;
}
The particular case I'm looking at has v1
as (0xC0A7BC40 [-5.241729736328125], 0xC0E84C80 [-7.25933837890625]) and v2
as (0x428848FB [68.14253997802734], 0x42BCBE40 [94.37158203125]). I focus on the hex versions of the values since those are the exact values that are the inputs on both platforms, verified by inspecting the memory locations of v1
and v2
on both platforms. For reference, the floating point values in brackets were grabbed from putting the hex values into this site.
On Windows the result is 0xBA15F8E8 [-0.0005720988847315311], and on iOS the result is 0xBA100000 [-0.00054931640625]. Of course the difference is small, but not really when you consider percentage-wise, and it accumulates over time to show deviations in the behavior of the physics. (Please do not suggest using doubles. It slows the game down, of course, and not using doubles is not the issue here. :) )
For reference, this is a debug build on both platforms, and the code compiles as:
Windows
static inline cpFloat
cpvcross(const cpVect &v1, const cpVect &v2)
{
01324790 push ebp
01324791 mov ebp,esp
01324793 sub esp,0C4h
01324799 push ebx
0132479A push esi
0132479B push edi
0132479C lea edi,[ebp-0C4h]
013247A2 mov ecx,31h
013247A7 mov eax,0CCCCCCCCh
013247AC rep stos dword ptr es:[edi]
return v1.x*v2.y - v1.y*v2.x;
013247AE mov eax,dword ptr [v1]
013247B1 fld dword ptr [eax]
013247B3 mov ecx,dword ptr [v2]
013247B6 fmul dword ptr [ecx+4]
013247B9 mov edx,dword ptr [v1]
013247BC fld dword ptr [edx+4]
013247BF mov eax,dword ptr [v2]
013247C2 fmul dword ptr [eax]
013247C4 fsubp st(1),st
013247C6 fstp dword ptr [ebp-0C4h]
013247CC fld dword ptr [ebp-0C4h]
}
013247D2 pop edi
013247D3 pop esi
013247D4 pop ebx
013247D5 mov esp,ebp
013247D7 pop ebp
013247D8 ret
iOS
invent`cpvcross at cpVect.h:63:
0x94a8: sub sp, sp, #8
0x94ac: str r0, [sp, #4]
0x94b0: str r1, [sp]
0x94b4: ldr r0, [sp, #4]
0x94b8: vldr s0, [r1]
0x94bc: vldr s1, [r1, #4]
0x94c0: vldr s2, [r0]
0x94c4: vldr s3, [r0, #4]
0x94c8: vmul.f32 s1, s2, s1
0x94cc: vmul.f32 s0, s3, s0
0x94d0: vsub.f32 s0, s1, s0
0x94d4: vmov r0, s0
0x94d8: add sp, sp, #8
0x94dc: bx lr
As near as I can tell, those calculations are identical, assuming each instruction is calculating the result of the operands identically. Xcode does not allow me to step over instruction-by-instruction, for some reason (which Visual Studio does allow), so I can't narrow down which instruction(s) are deviating compared to the Intel FP unit.
So, why is the result of such a simply calculation so different between the two CPUs?
Upvotes: 4
Views: 3181
Reputation: 25278
You're seeing the results of using different floating-point precision for calculations.
In the x86 code, the calculations are done in the FPU registers which are extended precision (80 bits), while NEON code uses floats (32-bit). Apparently the extra precision during multiplication and subtraction allows x86 code to retain more bits while ARM code loses them.
Using the _controlfp functions you can tell the FPU to use a specific precision for all calculations. I made a small program using the example from MSDN and was able to get the same result as ARM code:
#include <stdio.h>
typedef float cpFloat;
struct cpVect {cpFloat x, y;};
struct cpVectI {unsigned int x, y;};
union cpv {cpVectI i; cpVect f;};
union cfi { float f; unsigned int i;};
cpFloat cpvcross(const cpVect &v1, const cpVect &v2)
{
return v1.x*v2.y - v1.y*v2.x;
}
#include <float.h>
#pragma fenv_access (on)
void main(void)
{
cpv v1, v2;
cfi fi;
v1.i.x = 0xC0A7BC40;
v1.i.y = 0xC0E84C80;
v2.i.x = 0x428848FB;
v2.i.y = 0x42BCBE40;
unsigned int control_word_x87;
// Show original x87 control word and do calculation.
__control87_2(0, 0, &control_word_x87, 0);
printf( "Original: 0x%.4x\n", control_word_x87 );
fi.f = cpvcross(v1.f, v2.f);
printf("Result: %g (0x%08X)\n", fi.f, fi.i);
// Set precision to 24 bits and recalculate.
__control87_2(_PC_24, MCW_PC, &control_word_x87, 0);
printf( "24-bit: 0x%.4x\n", control_word_x87);
fi.f = cpvcross(v1.f, v2.f);
printf("Result: %g (0x%08X)\n", fi.f, fi.i);
// Restore default precision-control bits and recalculate.
__control87_2( _CW_DEFAULT, MCW_PC, &control_word_x87, 0);
printf( "Default: 0x%.4x\n", control_word_x87 );
fi.f = cpvcross(v1.f, v2.f);
printf("Result: %g (0x%08X)\n", fi.f, fi.i);
}
Here's the output:
Original: 0x9001f
Result: -0.000572099 (0xBA15F8E8)
24-bit: 0xa001f
Result: -0.000549316 (0xBA100000)
Default: 0x9001f
Result: -0.000572099 (0xBA15F8E8)
Be careful when using this function and calling external libraries; some code might be relying on the default settings and will break if you change them behind its back.
Another option could be to switch to SSE intrinsics which will use a specific precision. Unfortunately, /arch:SSE2
doesn't seem to make use of SSE2 for floating point (at least in VS2010).
Upvotes: 4