Is this expected behavior for float fused-multiply-add?

Question

I have three numbers with precise representation using (32-bit) floats:

x = 16277216, y = 16077216, z = -261692320000000

I expect performing a fused-multiply-add x*y+z to return the mathematically correct value but rounded. The correct mathematical value is -2489344, which need not be rounded, and therefore this should be the output of a fused-multiply-add. But when I perform fma(x,y,z) the result is -6280192 instead. Why?

I'm using rust. Note z is the rounded result of -x*y.

let x: f32 = 16277216.0;
let y: f32 = 16077216.0;
let z = - x * y;
assert_eq!(z, -261692320000000.0 as f32); // pass
let result = x.mul_add(y, z);
assert_eq!(result, -2489344.0 as f32); // fail

println!("x: {:>32b}, {}", x.to_bits(), x);
println!("y: {:>32b}, {}", y.to_bits(), y);
println!("z: {:>32b}, {}", z.to_bits(), z);
println!("result: {:>32b}, {}", result.to_bits(), result);

The output is

x:  1001011011110000101111011100000, 16277216
y:  1001011011101010101000110100000, 16077216
z: 11010111011011100000000111111110, -261692320000000
result: 11001010101111111010100000000000, -6280192

Eric Postpischil · Accepted Answer

I have three numbers with precise representation using (32-bit) floats:

x = 16277216, y = 16077216, z = -261692320000000

This premise is false. -261,692,320,000,000 cannot be represented exactly in any 32-bit floating-point format because its significand requires 37 bits to represent.

The IEEE-754 binary32 format commonly used for float has 24-bit significands. Scaling the significand of −261,692,320,000,000 to be under 2²⁴ in magnitude yields −261,692,320,000,000 = −15,598,077.7740478515625•2²⁴. As we can see, the significand is not an integer at this scale, so it cannot be represented exactly, and I would not call it precise either. The closest representable value is −15,598,078•2²⁴ = -261,692,323,790,848.

println!("z: {:>32b}, {}", z.to_bits(), z);
…
z: 11010111011011100000000111111110, -261692320000000

Rust is lying; the value of z is not -261692320000000. It may have used some algorithm like rounding to 8 significant digits and using zeros for the rest. The actual value of z is −261,692,323,790,848.

The value of 16,277,216•16,077,216 − 261,692,323,790,848 using ordinary real-number arithmetic is −6,280,192, so that result for the FMA is correct.

The rounding error occurred in let z = - x * y;, where multiplying 16,277,216 and 16,077,216 rounded the real-number-arithmetic result of 261,692,317,510,656 to the nearest value representable in binary32, 261,692,323,790,848.

Is this expected behavior for float fused-multiply-add?

Answers (1)

Related Questions