Reputation: 167
If floating point numbers can sometimes be inaccurate, why do they exist? What scenarios would someone want to use floating point numbers?
Upvotes: 1
Views: 240
Reputation: 222744
(First, note that floating-point numbers are exact. It is floating-point arithmetic that approximates real-number arithmetic. This distinction is important for designing good floating-point software and writing proofs about it.)
People use floating-point arithmetic because it is useful for working with numbers of diverse magnitudes. Consider using floating-point to design and construct a building or other structure. When the designer specifies a beam or cable that is 10 meters long, the actual delivered cable will not be 10 meters long. If you measure it and convert the result to a 32-bit float
1, the conversion might introduce an error, which will be less than one micrometer. Your measurement of the cable will have more error than that. So the floating-point error is minuscule and does not matter in this simple measurement.
When many calculations are done, these rounding errors can not only accumulate but can combine in surprising ways. If float
is not sufficient, we can use double
, in which the initial error for a measurement around 10 meters would be under 2 femtometers (10−15 meters).
So floating-point has plenty of precision for normal physical uses: Measuring and designing objects, processing audio or radio signals, evaluating hypotheses from physics or chemistry, and so on. When floating-point is used well, the representation errors and rounding errors in floating-point arithmetic simply do not matter. They are too small to notice; they have no observable effects on the work being done.
Issues using floating-point arise when novices are accustomed to the rigidity of most integer arithmetic and are surprised by how floating-point behaves. Although an error might be one part in 9•1015, if it means the result is 6.99999999999999911182158029987476766109466552734375 instead of 7 in a number that is converted to int
, then they get the wrong result and do not understand how their program went wrong. Mostly this error arises among students and Stack Overflow question writers and is not a problem in practice, when floating-point code is used by people who have learned the basics of floating-point arithmetic.
Issues also arise because, as mentioned above, errors can combine in surprising ways. Matrix operations, for example, can be “unstable,” meaning they tend to amplify errors. Thus, although a floating-point format may have plenty of precision, results may have great errors (compared to real-number arithmetic) due to mathematical properties of the data and operations.
Nonetheless, floating-point is very useful for some work where it would be a burden to use integer arithmetic. When the numbers are diverse in magnitude, it is hard to write integer arithmetic that handles them. Either the scaling has to be designed in advance (which limits what data a program can work with) or it has to be managed by the program, which is essentially a reinvention of floating-point.
1 IEEE-754 basic 32-bit binary floating-point, which has a sign, an eight-bit exponent, and a 24-bit significand.
2 IEEE-754 basic 64-bit binary floating-point, which has a sign, an eleven-bit exponent, and a 53-bit significand.
Upvotes: 3