Reputation: 154230
In experimenting with rounding modes, FE_DOWNWARD
( and FE_TOWARDZERO
) apparently fails to form the expected sum of infinity and instead forms DBL_MAX
when adding DBL_MAX
and 1 ULP of DBL_MAX
.
Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX
values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST
and FE_UPWARD
.
Questions:
DBL_MAX
problem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2)/4;
int P = (LDBL_MANT_DIG + 2)/4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf("mode: Addendum SUM (double) SUM (long double)\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
double sum = max + delta[d];
printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
}
puts("");
}
}
Output: note 4 unexpected lines.
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double)
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024
Upvotes: 6
Views: 169
Reputation: 154230
@Sneftel idea to use a 2 stage rounding model was useful and well explains why my expected result was amiss and the code was right.
Below, for reference, is augmented code that posts the overflow flag and helps illustrates the models application.
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
const char* exceptstr(fexcept_t *flag) {
static char buf[100];
buf[0] = 0;
if (*flag & FE_DIVBYZERO)
strcat(buf, STRINGIFY(FE_DIVBYZERO));
if (*flag & FE_INEXACT)
strcat(buf, STRINGIFY(FE_INEXACT));
if (*flag & FE_INVALID)
strcat(buf, STRINGIFY(FE_INVALID));
if (*flag & FE_OVERFLOW)
strcat(buf, STRINGIFY(FE_OVERFLOW));
if (*flag & FE_UNDERFLOW)
strcat(buf, STRINGIFY(FE_UNDERFLOW));
return buf;
}
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2) / 4;
int P = (LDBL_MANT_DIG + 2) / 4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf(
"mode: Addendum SUM (double) SUM (long double) FE\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
if (feclearexcept(FE_ALL_EXCEPT)) {
perror("feclearexcept()");
return -1;
}
///////////////////////////////
double sum = max + delta[d];
///////////////////////////////
fexcept_t flag;
if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
perror("fegetexceptflag()");
return -1;
}
printf(
"mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max + delta[d],
exceptstr(&flag));
}
puts("");
}
}
Output (Look to the far right)
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double) FE
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024 FE_INEXACTFE_OVERFLOW
Upvotes: 0
Reputation: 41503
Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.
FE_NEAREST
(the "normal" mode) the number is corrected to infinity. For FE_TOWARDZERO
, however, it's corrected to +/-DBL_MAX
. (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to +/-DBL_MAX
.)The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably FE_NEAREST
is the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away than DBL_MAX
. But the basic behavior for the other modes is to output +/-DBL_MAX
when its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.
Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is DBL_MAX
. The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".
Upvotes: 2