drasko
drasko

Reputation: 23

Round Down a Floating Point Result

I have two floating point (double) values a and b, and I'm looking to add them to get a result c.

I know that c will somehow be approximated, because everything is finite precision. Now, I want to 'round down' c, meaning that the floating point c is not greater than the real sum of floating points a and b, or c <= a + b.

How can I do this? The following code in c comes to mind, but I'm not sure whether the answer will be what I want.

c = nextafter(a + b, bigNegativeNumber)

Same question goes for multiplication instead of addition. :)

PS. If it helps, a and b are always non-negative numbers.

Edit: c should be a floating point as well

Upvotes: 2

Views: 267

Answers (2)

chux
chux

Reputation: 153468

A tricky problem.

@EOF's comment above to "round toward 0" is good and will provide an optimal result.

#ifdef _ _STDC_IEC_559_ _ 
    fesetround(FE_DOWNWARD);
    c = a + b;
#else
   #error unable to set rounding mode
#endif

OP's original approach is close too. Any good compilation/processor should be expected to create the best answer to with in 0.5 or 1.0 ULP (depending on rounding mode). It will certainly create a sum c2 less than arithmetic a+b, but c may meet the requirements too.

c = a + b
c2 = nextafter(c, -DBL_MAX);

c = floor(a + b) will not work as a could be far larger in magnitude than some small negative b such that the computed sum is simple still a and fails the arithmetic c <= a + b.

Upvotes: 0

njuffa
njuffa

Reputation: 26085

Based on your description, it seems like you want to control the rounding mode for a floating-point operation. This is supported in C99 by the functionality provided in the header file fenv.h. You may need to instruct your compiler to turn on C99 support, and you may need to instruct it to perform floating-point arithmetic in an IEEE-754 compliant way. Below is a minimal example showing how to perform double addition with truncation (rounding towards zero). Since your operands are known to be positive, this is equivalent to rounding down (towards negative infinity).

#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>

#pragma STDC FENV_ACCESS ON

double dadd_rz (double a, double b) 
{
    double res;
    int orig_mode = fegetround ();
    fesetround (FE_TOWARDZERO);  // set rounding mode to truncate
    res = a + b;
    fesetround (orig_mode);      // restore rounding mode
    return res;
}

int main (void)
{
    double a = 0x1.fffffffffffffp1023;
    printf ("                  a = %20.13a\n", a);
    printf ("                a+a = %20.13a\n", a + a);
    printf ("round_to_zero (a+a) = %20.13a", dadd_rz (a, a));
    return EXIT_SUCCESS;
}

The output of the above program should look something like this (note that the printing of infinities is implementation dependent):

                  a = 0x1.fffffffffffffp+1023
                a+a = 0x1.#INF000000000p+0
round_to_zero (a+a) = 0x1.fffffffffffffp+1023

Upvotes: 1

Related Questions