Reputation: 107
I am reading some data from a CSV file, and I have custom code to parse string values into different data types. For numbers, I use:
val format = NumberFormat.getNumberInstance()
which returns a DecimalFormat
, and I call parse
function on that to get my numeric value. DecimalFormat
has arbitrary precision, so I am not losing any precision there. However, when the data is pushed into a Spark DataFrame, it is stored using DoubleType
. At this point, I am expecting to see some precision issues, however I do not. I tried entering values from 0.1, 0.01, 0.001, ..., 1e-11 in my CSV file, and when I look at the values stored in the Spark DataFrame, they are all accurately represented (i.e. not like 0.099999999). I am surprised by this behavior since I do not expect a double value to store arbitrary precision. Can anyone help me understand the magic here?
Cheers!
Upvotes: 2
Views: 16185
Reputation: 1886
There are probably two issues here: the number of significant digits that a Double can represent in its mantissa; and the range of its exponent.
Roughly, a Double has about 16 (decimal) digits of precision, and the exponent can cover the range from about 10^-308 to 10^+308. (Obviously, the actual limits are set by the binary representation used by the ieee754 format.)
When you try to store a number like 1e-11, this can be accurately approximated within the 56 bits available in the mantissa. Where you'll get accuracy issues is when you want to subtract two numbers that are so close together that they only differ by a small number of the least significant bits (assuming that their mantissas have been aligned shifted so that their exponents are the same).
For example, if you try (1e20 + 2) - (1e20 + 1), you'd hope to get 1, but actually you'll get zero. This is because a Double does not have enough precision to represent the 20 (decimal) digits needed. However, (1e100 + 2e90) - (1e100 + 1e90) is computed to be almost exactly 1e90, as it should be.
Upvotes: 3