cfloating-pointoverflowprecisionieee-754

Reputation: 162

IEEE-754: "smallest" overflow condition

Before I start, just some background information:

I'm running a bare-metal application on an ARM7 microcontroller (LPC2294/01) compiled in Keil uVision3, using the compiler standard math library (which is IEEE-754 compliant).

The issue: I'm having trouble wrapping my head around what exactly constitutes an 'overflow' on the sum of 2 single-precision floating point inputs.

Initially, I was under the impression that if I attempted to add any positive value to the largest value that can be represented by IEEE-754 notation, the result would generate an overflow exception.

So for instance, suppose I have:

 a = 0x7f7fffff (ie. 3.4028235..E38);
 b = 0x3f800000 (ie. 1.0)

I expected that summing these two values would result in overflow as defined in IEEE-754. To my initial surprise, the result simply returned the value of 'a' with no exception being flagged.

So then I thought, since the precision (or resolution if you prefer) decreases as the value being represented increases, it's likely the value '1' in this case is being effectively rounded down to 0 due to its relative insignificance.

So that begged the question: What would be the smallest value of 'b' in this case that would cause an overflow exception? Does it depend on the specific implementation of IEEE-754?

Maybe it's as simple as me not understanding how to determine the minimum 'significant' precision in this particular case, but given the code below, why would the second sum cause an overflow and not the first?

static union sFloatConversion32
{
     unsigned int unsigned32Value;
     float floatValue;
} sFloatConversion32;

t_bool test_Float32_Addition(void)
{
   float a;
   float b;
   float c;

   sFloatConversion32.unsigned32Value = 0x7f7fffff;
   a = sFloatConversion32.floatValue;

   sFloatConversion32.unsigned32Value = 0x72ffffff;
   b = sFloatConversion32.floatValue;

   /* This sum returns (c = a) without overflow */
   c = a + b;

   sFloatConversion32.unsigned32Value = 0x73000000;
   b = sFloatConversion32.floatValue;

   /* This sum, however, causes an overflow exception */
   c = a + b;
}

Is there a generalized rule that can be applied such that it would be possible to know ahead of time (ie. without performing the sum), that given two floats, their sum will cause an overflow as defined by IEEE-754?

Upvotes: 2

Answers (3)

Eric Postpischil

Reputation: 222900

Overflow occurs when the result is affected by the range of the format. As long as normal rounding keeps the result within the finite range, no overflow occurs, because the result is the same as it would be if the exponent were unbounded—the result was reduced by the normal rounding, before range was considered. So there is no exception due to range.

When the rounded result does not fit into the finite range of the format, then a finite result cannot be produced, so an overflow exception occurs and infinity is produced.

In IEEE 754, a normal operation is in effect two steps:

Calculate the exact mathematical result.
Round the exact mathematical result to the nearest representable value.

IEEE 754 defines overflow to occur if and only if the the result of the above exceeds in magnitude the largest representable finite value. In other words, overflow does not occur just because you went above the largest representable value but only if you go so far above the largest representable value that the normal way arithmetic works in floating-point does not work.

So, if you start with the largest representable value and add a small number to it, the result would simply round to the largest representable value anyway (when using round-to-nearest). IEEE 754 regards this as normal—all arithmetic operations round, and if that rounding kept the result in bounds, that is normal and unexceptionable. Even if the exponent range were unbounded, normal rounding would have produced the same result. Since this is a normal result not affected by the limited range, nothing exceptional has occurred.

Overflow occurs only when the mathematical result is so large that rounding would produce the next higher number if we were not limited by the exponent. (But, since we have reached the limits of the exponent range, we must return infinity.)

The largest representable value in IEEE-754 basic 32-bit binary floating-point is 2¹²⁸−2¹⁰⁴. At this point, the steps between representable numbers are in units of 2¹⁰⁴. With the round-to-nearest rule, adding any number less than half a step, 2¹⁰³, to this will round to 2¹²⁸−2¹⁰⁴, and no overflow occurs. If you add a number greater than 2¹⁰³, then the result would round to 2¹²⁸ if the exponent could go that high. Instead, infinity is produced and an overflow exception occurs. (If you add exactly 2¹⁰³, the rule for ties is used. This rule says to choose the candidate with the even low bit. That produces 2¹²⁸, so it also overflows.)

So, with round-to-nearest, overflow occurs at the midpoint of a step. With other rounding rules, overflow occurs at different points. With round-toward-infinity (round up), adding any positive value, even 2⁻¹⁴⁹, to 2¹²⁸−2¹⁰⁴ will cause an overflow. With round-toward-zero, adding any value less than 2¹⁰⁴ to 2¹²⁸−2¹⁰⁴ will not overflow.

Upvotes: 1

chux

Reputation: 153517

Does it depend on the specific implementation of IEEE-754?

Yes and the rounding mode active at the time.

Consider the step between the x before max and FLT_MAX.

float max = FLT_MAX;
float before_max = nextafterf(max, 0.0f);
float delta = max - before_max;
printf("max:   %- 20a %.*g\n", max, FLT_DECIMAL_DIG, max);
printf("1st d: % -20a %.*g\n", delta, FLT_DECIMAL_DIG, delta);
// Typical output
max:    0x1.fffffep+127     3.40282347e+38
b4max:  0x1.fffffep+127     3.40282347e+38
1st d:  0x1p+104            2.02824096e+31

The largest float is about twice the float with the same smallest float with the same steps or ULP. Think of this smaller float with all its explicit precision bits cleared versus set as with FLOAT_MAX.

float m0 = nextafterf(max/2, max);
printf("m0:    %- 20a %.*g\n", m0, FLT_DECIMAL_DIG, m0);
// m0:     0x1p+127            1.70141183e+38

Now compare this to FLT_EPSILON, the smallest step from 1.0 to the next larger float:

float eps = FLT_EPSILON;
printf("epsil: %- 20a %.*g\n", eps, FLT_DECIMAL_DIG, eps);
// Output
// epsil:  0x1p-23             1.1920929e-07

Notice the ratio delta/m0 is FLT_EPSILON.

float r = delta1/m0;
printf("r:     %- 20a %.*g\n", r, FLT_DECIMAL_DIG, r);
// r:      0x1p-23             1.1920929e-07

Consider the typical rounding mode of rounding to nearest, ties to even.
Now let us try adding 1/2*delta1 to FLOAT_MAX and then try adding the next smaller float.

sum = max + delta1/2;
printf("sum:        % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
sum = nextafterf(sum, 0);
printf("sum:        % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
// sum:         inf                 inf
// sum:         0x1.fffffep+127     3.40282347e+38

IEEE-754: “smallest” overflow condition

We can see the smallest delta if about FLT_MAX*1/2*1/2*FLOAT_EPSILON.

float small = FLT_MAX*0.25f*FLT_EPSILON;
printf("small: %- 20a %.*g\n", small, FLT_DECIMAL_DIG, small);
printf("sum:        % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
small = nextafterf(small, max);
printf("sum:        % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
// sum:         0x1.fffffep+127     3.40282347e+38
// sum:         inf                 inf

Given the various possible encoding for float, your results may differ, yet this approach gives an idea of how to determine the smallest delta that cause overflow.

Upvotes: 1

0___________

Reputation: 67546

Run this program long enough and see what will happen:

float x = 10000000.0f;
while(1)
{
    printf("%f\n", x);
    x += 1.0f;
}

I think it will answer your question.

Upvotes: -1

IEEE-754: &quot;smallest&quot; overflow condition

Answers (3)

Related Questions

IEEE-754: "smallest" overflow condition