Reputation: 1324
I was doing some homework problems from my textbook and had a few questions on floating point rounding / precision for certain arithmetic operations.
If I have casted doubles from an int like so:
int x = random();
double dx = (double) x;
And let's say the variables y, z, dy, and dz follow the same format.
Then would operations like:
(dx + dy) + dz == dx + (dy + dz)
(dx * dy) * dz == dx * (dy * dz)
be associative? I know that if we have fractional representations, then it would not be associative because some precision will be lost due to rounding depending on which operands add / multiply each other. However, since these are casted from ints, I feel like the precision would not be a problem and that these can be associative?
And lastly, the textbook I'm using does not explain FP division at all so I was wondering if this statement was true, or at least just how floating point division works in general:
dx / dx == dz / dz
I looked this up online and I read in some areas like an operation like 3/3 can yield .999...9 but there wasn't enough information to explain how that happened or if it would vary with other division operations.
Upvotes: 2
Views: 984
Reputation: 19037
You should understand that floating point numbers are typically internally represented as a sign bit, a fixed point mantissa (of 52 bits with an implied leading one for IEEE 64-bit doubles), and a binary exponent (11 bits for IEEE doubles). You can think of the exponent as the "quantum" of math units for a given value.
The addition should be associative if the sums all fit into the mantissa without the exponent going above 20 == 1. If random()
is producing 32-bit integers, a sum such as (dx + dy) + dz
will fit, and the addition will be associative.
In the case of multiplication, it's easy to see that the product of 2 32-bit numbers may go well over 53 bits, so the exponent may need to go above 1 for the mantissa to contain the magnitude of the result, so associativity fails.
For division, in the particular case of dx / dx
, the compiler may replace the expression with a constant 1.0 (perhaps after a zero check).
Upvotes: 1
Reputation: 122373
Assuming int
is at most 32-bit, and double
follows IEEE-754. double
can store integer value at most 253 precisely.
In the case of addition:
(dx + dy) + dz == dx + (dy + dz)
Both sides of ==
will have their precise values, so it is associative.
While in the case of multiplication:
(dx * dy) * dz == dx * (dy * dz)
It's possible that the value is over 253, so they are not guaranteed to be equal.
Upvotes: 1