PGupta
PGupta

Reputation: 215

Float operations using double

I have a function which takes two strings(floating point) , operation and floating point bit-width:

EvaluateFloat(const string &str1, const string &str2, enum operation/*add,subtract, multiply,div*/, unsigned int bit-width, string &output)

input str1 and str2 could be float(32 bit) or double (64 bit).

Is it fine If store the inputs in double and perform double operation irrespective of bit-width and depending upon bit-width typecast it to float if it was 32 bit. e.g

double num1 = atof(str1);
double num2 = atof(str2);
double result = num1 operation num2; //! operation will resolved using switch
if(32 == bit-width)
{
 float f_result = result;
 output = std::to_string(f_result);
}
else
{
 output = std::to_string(result);
}

Can I assume safely f_result will be exactly same if I had performed operation using float type for float operations i.e.

float f_num1 = num1;
float f_num2 = num2;
float f_result = f_num1 operation f_num2

PS:

  1. We assume there won;t be any cascaded operation i.e. out = a + b + c instead it will transformed to: temp = a +b out = temp + c
  2. I'm not concerned by inf and nan values.
  3. I'm trying to code redundancy otherwise I have two do same operation twice once for float and other for double

Upvotes: 0

Views: 1039

Answers (2)

gmatht
gmatht

Reputation: 835

According to standards floating point operations on double is equivalent to doing the operation in infinite precision. If we convert it to float we have now rounded it twice. In general this is not equivalent to just rounding to a float in the first place. For example. 0.47 rounds to 0.5 which rounds to 1, but 0.47 rounds directly to 0. As mentioned by chtz, multiplication of two floats should always be exactly some double (using IEEE math where double has more than twice the precision of float), so when we cast to a float we have still only lost precision once and so the result should be the same. Likewise addition and subtraction should not be a problem.

Division cannot be exactly represented in a double (not even 1/3), so we may think there is a problem with division. However I have run the sample code over night, trying over 3 trillion cases and have not found any case where running the original divide as a double gives a different answer.

#include <iostream>

int main() {
        long i=0;
        while (1) {
                float x = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
                float y = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
                float  f = x / y;
                double d = (double)x / (double)y;
                if(++i % 10000000 == 0) { std::cout << i << "\t" << x << "," << y << std::endl; }
                if ((float(d) !=  f)) {
                        std::cout << std::endl;
                        std::cout << x << "," << y << std::endl;
                        std::cout << std::hex << *(int*)&x << "," << std::hex << *(int*)&y << std::endl;
                        std::cout << float(d) - f << std::endl;
                        return 1;
                }
        }
}

Upvotes: 1

Eric Postpischil
Eric Postpischil

Reputation: 222352

C++ does not specify which formats are used for float or double. If IEEE-754 binary32 and binary64 are used, then double-rounding errors do not occur for +, -, *, /, or sqrt. Given float x and float y, the following hold (float arithmetic on the left, double on the right):

  • x+y = (float) ((double) x + (double) y).
  • x-y = (float) ((double) x - (double) y).
  • x*y = (float) ((double) x * (double) y).
  • x/y = (float) ((double) x / (double) y).
  • sqrt(x) = (float) sqrt((double) x).

This is per the dissertation A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages by Samuel A. Figueroa del Cid, January 2000, New York University. Essentially, double has so many digits (bits) beyond float that the rounding to double never conceals the information needed to round correctly to float for results of these operations. (This cannot hold for operations in general; it depends on properties of these operations.) On page 57, Figueroa del Cid gives a table showing that, if the float format has p bits, then, to avoid double rounding errors, double must have 2p+1 bits for addition or subtraction, 2p for multiplication and division, and 2p+2 for sqrt. Since binary32 has 24 bits in the significand and double has 53, these are satisfied. (See the paper for details. There are some caveats, such as that p must be at least 2 or 4 for the various operations.)

Upvotes: 2

Related Questions