Reputation: 215
I have a function which takes two strings(floating point) , operation and floating point bit-width:
EvaluateFloat(const string &str1, const string &str2, enum operation/*add,subtract, multiply,div*/, unsigned int bit-width, string &output)
input str1 and str2 could be float(32 bit) or double (64 bit).
Is it fine If store the inputs in double and perform double operation irrespective of bit-width and depending upon bit-width typecast it to float if it was 32 bit. e.g
double num1 = atof(str1);
double num2 = atof(str2);
double result = num1 operation num2; //! operation will resolved using switch
if(32 == bit-width)
{
float f_result = result;
output = std::to_string(f_result);
}
else
{
output = std::to_string(result);
}
Can I assume safely f_result will be exactly same if I had performed operation using float type for float operations i.e.
float f_num1 = num1;
float f_num2 = num2;
float f_result = f_num1 operation f_num2
PS:
Upvotes: 0
Views: 1039
Reputation: 835
According to standards floating point operations on double
is equivalent to doing the operation in infinite precision. If we convert it to float
we have now rounded it twice. In general this is not equivalent to just rounding to a float
in the first place. For example. 0.47 rounds to 0.5 which rounds to 1, but 0.47 rounds directly to 0. As mentioned by chtz, multiplication of two floats should always be exactly some double (using IEEE math where double
has more than twice the precision of float
), so when we cast to a float
we have still only lost precision once and so the result should be the same. Likewise addition and subtraction should not be a problem.
Division cannot be exactly represented in a double (not even 1/3), so we may think there is a problem with division. However I have run the sample code over night, trying over 3 trillion cases and have not found any case where running the original divide as a double
gives a different answer.
#include <iostream>
int main() {
long i=0;
while (1) {
float x = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float y = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float f = x / y;
double d = (double)x / (double)y;
if(++i % 10000000 == 0) { std::cout << i << "\t" << x << "," << y << std::endl; }
if ((float(d) != f)) {
std::cout << std::endl;
std::cout << x << "," << y << std::endl;
std::cout << std::hex << *(int*)&x << "," << std::hex << *(int*)&y << std::endl;
std::cout << float(d) - f << std::endl;
return 1;
}
}
}
Upvotes: 1
Reputation: 222352
C++ does not specify which formats are used for float
or double
. If IEEE-754 binary32 and binary64 are used, then double-rounding errors do not occur for +
, -
, *
, /
, or sqrt
. Given float x
and float y
, the following hold (float
arithmetic on the left, double
on the right):
x+y
= (float) ((double) x + (double) y)
.x-y
= (float) ((double) x - (double) y)
.x*y
= (float) ((double) x * (double) y)
.x/y
= (float) ((double) x / (double) y)
.sqrt(x)
= (float) sqrt((double) x)
.This is per the dissertation A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages by Samuel A. Figueroa del Cid, January 2000, New York University. Essentially, double
has so many digits (bits) beyond float
that the rounding to double
never conceals the information needed to round correctly to float
for results of these operations. (This cannot hold for operations in general; it depends on properties of these operations.) On page 57, Figueroa del Cid gives a table showing that, if the float
format has p bits, then, to avoid double rounding errors, double
must have 2p+1 bits for addition or subtraction, 2p for multiplication and division, and 2p+2 for sqrt
. Since binary32 has 24 bits in the significand and double
has 53, these are satisfied. (See the paper for details. There are some caveats, such as that p must be at least 2 or 4 for the various operations.)
Upvotes: 2