Reputation: 23
I have two floating point (double) values a and b, and I'm looking to add them to get a result c.
I know that c will somehow be approximated, because everything is finite precision. Now, I want to 'round down' c, meaning that the floating point c is not greater than the real sum of floating points a and b, or c <= a + b.
How can I do this? The following code in c comes to mind, but I'm not sure whether the answer will be what I want.
c = nextafter(a + b, bigNegativeNumber)
Same question goes for multiplication instead of addition. :)
PS. If it helps, a and b are always non-negative numbers.
Edit: c should be a floating point as well
Upvotes: 2
Views: 267
Reputation: 153468
A tricky problem.
@EOF's comment above to "round toward 0" is good and will provide an optimal result.
#ifdef _ _STDC_IEC_559_ _
fesetround(FE_DOWNWARD);
c = a + b;
#else
#error unable to set rounding mode
#endif
OP's original approach is close too. Any good compilation/processor should be expected to create the best answer to with in 0.5 or 1.0 ULP (depending on rounding mode). It will certainly create a sum c2
less than arithmetic a+b
, but c
may meet the requirements too.
c = a + b
c2 = nextafter(c, -DBL_MAX);
c = floor(a + b)
will not work as a
could be far larger in magnitude than some small negative b
such that the computed sum is simple still a
and fails the arithmetic c <= a + b
.
Upvotes: 0
Reputation: 26085
Based on your description, it seems like you want to control the rounding mode for a floating-point operation. This is supported in C99 by the functionality provided in the header file fenv.h
. You may need to instruct your compiler to turn on C99 support, and you may need to instruct it to perform floating-point arithmetic in an IEEE-754 compliant way. Below is a minimal example showing how to perform double
addition with truncation (rounding towards zero). Since your operands are known to be positive, this is equivalent to rounding down (towards negative infinity).
#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
double dadd_rz (double a, double b)
{
double res;
int orig_mode = fegetround ();
fesetround (FE_TOWARDZERO); // set rounding mode to truncate
res = a + b;
fesetround (orig_mode); // restore rounding mode
return res;
}
int main (void)
{
double a = 0x1.fffffffffffffp1023;
printf (" a = %20.13a\n", a);
printf (" a+a = %20.13a\n", a + a);
printf ("round_to_zero (a+a) = %20.13a", dadd_rz (a, a));
return EXIT_SUCCESS;
}
The output of the above program should look something like this (note that the printing of infinities is implementation dependent):
a = 0x1.fffffffffffffp+1023
a+a = 0x1.#INF000000000p+0
round_to_zero (a+a) = 0x1.fffffffffffffp+1023
Upvotes: 1