zino
zino

Reputation: 1472

Given two floats, how can I detect when precision will be lost in calculations?

Considering 64 bit/8 byte/double IEEE floats that represent exact numbers (not approximations).

Im using JS and have access to the mantissa, exponent.

Im interested in the correct general use of floating point arithmetic (addition, subtraction, multiplication, division).

Is the following correct about floats:

So how do I programatically detect the numbers are in the same window of precision/scale?

Would it be possible to find the exact number half way between a very large and very small number, or is that calculation stuck at (large/2) because ((large+small)/2) cancels out the smaller number?

Upvotes: 1

Views: 347

Answers (1)

Sneftel
Sneftel

Reputation: 41552

You can only represent exact decimal numbers in binary floats where the denominator is a power of 2

Yep! This follows trivially from the definition of a floating-point number. Of course, not all numbers with power-of-2 denominators are exactly representable as binary64 floats.

It only makes sense to do calculations when both numbers are in the same "window of precision" or scale. E.g. very small numbers calculated with other very small numbers? Never very large numbers and very small numbers?

When specifically considering addition and subtraction, this is roughly true. In particular I refer you to Sterbenz's Theorem, which is roughly: If you subtract one number from another and the two are within a factor of 2, the result is exact. However, that's only a sufficient condition, not a necessary one. More generally, addition is exact in situations where ((a+b) - a) - b ,evaluated in a floating point context without extended precision, produces zero. See Kahan's summation algorithm for more about this sort of thing.

For multiplication, it's actually pretty easy to check for error. Right shift each significand until its LSB is one, then multiply the two shifted significands, and if the result is 2^53 or greater, there's rounding. You can do the same sort of significand-based "trial operation" with addition/subtraction, but first you need to shift the larger significand by the difference in exponent (and you then need to right shift both significands until one has a 1 in the LSB).

For division, it's even easier. You've noted that only rational numbers with a power-of-2 denominator are exactly representable. So if you ever divide a nonzero number by a number which is not a power of two (that is, which has a nonzero mantissa), there's rounding, unless you can exactly divide the right-shifted significands as integers.

Note that in all of this I've been ignoring underflow and overflow, and sweeping denormalized inputs under the rug. If you want to do this correctly, you really need to get a good book on floating point calculation. I can recommend Higham's "Accuracy and Stability of Numerical Algorithms".

Upvotes: 3

Related Questions