Reputation: 923
When using numerical methods (e.g. Runge-Kutta), the finite precision of floats on the computer can influence the solution (Brouwer's Law).
In this paper it suggests as a remedy simulating exact Runge-Kutta coefficients as e.g. A = B + C where B is an exact machine number, and C some small correction
Can someone explain how this works in practice? E.g. if A = 3/10, then how would one determine B and C?
Thanks for any help.
Upvotes: 1
Views: 97
Reputation: 7874
That trick may have worked in 2007 when the paper was submitted, but I think it would be unlikely to work on a modern platform.
On a modern x86 (both 32 and 64 bit) processor there are two separate instruction sets for floating point computations:
the older x87 instructions (dating back to the original 8087 coprocessor), which had 80-bit registers
more recent SSE instructions, which used registers of the same width as the format (32 bits for float
, 64 bits for double
).
The newer SSE instructions are generally preferred by modern compilers, as tend to be faster, as they can be fully pipelined, and support fancy things like SIMD operations. However in 2007, some compilers still only used x87 instructions by default, as the binaries could then be used on older machines (this was especially true on 32-bit machines).
The 80-bit registers supported a significand of up to 64-bits, which is 11-bits more than the 53-bit significand of a 64-bit double
. The idea is that you could potentially reduce intermediate rounding error, which in this case you could exploit.
Consider a simpler version of their problem: computing
Y = A*X
by splitting A
into B+C
as they suggest, B
has only 10 significant bits. Then the operation
B*X
does not incur any rounding error, since it will have at most 63 significant bits. The full computation
Y = B*X + C*X
will thus give you the result to almost the full 64 bits of accuracy.
Without the extended precision, B*X
will generally incur rounding error of roughly the same size as if you had just compute A*X
directly (unless X
itself has been stored with reduce precision).
Now this sounds great: you might wonder why the SSE instructions got rid of this? Unfortunately, it wasn't predictable: in some cases the compiler would arrange it so that this would work, but in other cases it would need to "spill" the registers to memory, in which case you would lose this extra precision. This would in turn give sometimes bizarre results, like having operations such as x+y == x+y
evaluate to false, depending on when the individual operations were evaluated.
However, all is not lost! If you have a fairly recent machine, you may be able to take advantage of fused multiply-add (fma) operations to get increased accuracy. In this case, it would look something like
Y = fma(B,X,C*X)
Upvotes: 1
Reputation: 1141
In the paper they suggest using a rational approximation for A with denominator 1024. (This means that A has at most 10 significant non-zero bits). You have (3/10)*1024 = 307.2, so B will be
B=307/1024 = 0.2998046875 and C = A - B = 0.0001953125
C is not exactly representable as IEEE Binary64, the nearest floating point number will be
C = 1.9531249999998889776975374843...E-4.
Insert these values in the formulas (3.1f)
Upvotes: 2