Reputation: 493
I implemented a "double-double" class in C++. A "double-double" uses two doubles
to increase precision. A number is treated as number = hi + lo
, though the sum is never actually calculated because hi + lo == hi
. Here is a shortened version of my code:
class doubledouble
{
public:
double hi, lo;
doubledouble()
{
hi = 0.0;
lo = 0.0;
}
doubledouble quickTwoSum(const double a, const double b) const
{
int old;
doubledouble out;
double s = a + b;
double e = b - (s - a);
out.hi = s;
out.lo = e;
return (out);
}
doubledouble twoSum(const double a, const double b) const
{
double s = a + b;
double v = s - a;
double e = (a - (s - v)) + (b - v);
doubledouble out;
out.hi = s;
out.lo = e;
return (out);
}
doubledouble operator+(const doubledouble& in) const
{
doubledouble s, t;
s = twoSum(in.hi, this->hi);
t = twoSum(in.lo, this->lo);
s.lo += t.hi;
s = quickTwoSum(s.hi, s.lo);
s.lo += t.lo;
s = quickTwoSum(s.hi, s.lo);
return(s);
}
doubledouble split(const double a) const
{
const double split = (1 << 12) + 1;
double t = a * split;
doubledouble out;
out.hi = t - (t - a);
out.lo = a - out.hi;
return (out);
}
doubledouble twoProd(const double a, const double b) const
{
double p = a * b;
doubledouble aS = split(a);
doubledouble bS = split(b);
double err = ((aS.hi * bS.hi - p) + aS.hi * bS.lo + aS.lo * bS.hi) + aS.lo * bS.lo;
aS.hi = p;
aS.lo = err;
return (aS);
}
doubledouble operator*(const doubledouble& in) const
{
doubledouble p;
p = twoProd(this->hi, in.hi);
p.lo += this->hi * in.lo + this->lo * in.hi;
p = quickTwoSum(p.hi, p.lo);
return(p);
}
};
My problem is that this code works in debug-mode, but in release-mode I only get normal double
-precision. I am using the MSVC compiler and if I put a #pragma optimize("", off)
before the class and a corresponding on
behind it the code works (very slow of course).
I am pretty sure the problem lies in things like t - (t - a)
which mathematically is simply equal to a
, so the compiler probably optimizes it like that. To narrow down the problem, I tried putting #pragma optimize
around certain functions. But even if I put every single function inside the class between pragmas, the code does not work correctly.
Either way, using #pragma optimize
is not very useful anyway. The code is slow and it is not portable. My first question is: Why does #pragma optimize
not work inside for specific functions inside the class? That would be really useful to narrow down the problem.
Second question: How could I trick/convince the compiler to not optimize certain expressions like t - (t - a)
? Using a temp-variable would probably not work, because the compiler would optimize that away.
Edit
As a reproducible example I used this code:
double a = 1.23456789012345;
const double split = (1 << 12) + 1;
double t = a * split;
double hi = t - (t - a);
double lo = a - hi;
std::cout << "hi: " << hi << ", lo: " << lo << std::endl;
Output in both cases (release- and debug-mode):
hi: 1.23457, lo: -3.48832e-13
So t - (t - a)
does not seem to be the problem.
As for a "simple" example (if I find a simpler one, I will edit) that produces the wrong behavior - I calculated the mandelbrot-set with zooms greater than 10^14. In debug mode I got perfect results, in release mode I get pixelation. I used this function to create the images:
void CalcMandelCPU_doubledouble(std::atomic<unsigned int>* RowCounter)
{
std::complex<doubledouble> Z, C;
int Y = (*RowCounter)++;
while (Y < ResY)
{
for (int X = 0; X < ResX; X++)
{
C.real(minX + (X + 0.5) * (maxX - minX) / ResX);
C.imag(minY + (Y + 0.5) * (maxY - minY) / ResY);
int i;
Z = 0;
for (i = 0; i < MaxIter; i++)
{
Z = Z * Z + C;
if (Z.real() * Z.real() + Z.imag() * Z.imag() > 4.0)
break;
}
if (i == MaxIter)
{
Result[Y * ResX + X] = 0;
}
else
{
Result[Y * ResX + X] = i + 1 - log(log(Z.real() * Z.real() + Z.imag() * Z.imag())) / log(2.0);
}
}
Y = (*RowCounter)++;
}
}
Define MaxIter
, ResX
and ResY
. minX
, maxX
, minY
and maxY
will give you the (zoomed) view. Examples for the view-variables (as doubledoubles
) :
minX.hi = -1.9997740601362948
, minX.lo = 9.2915246622385581e-17
maxX.hi = -1.9997740601362861
, maxX.lo = 7.5150963188138267e-17
minY.hi = -3.2900430992572130e-09
, minY.lo = 1.7490722630808694e-25
maxY.hi = -3.2900375437016574e-09
, maxY.lo = 1.0427104313278710e-25
Upvotes: 1
Views: 415
Reputation: 222273
Write only single operations, using a cast or assignment to isolate each assignment.
The C++ standard allows an implementation to use excess precision in expression evaluation but requires casts and assignments to use the nominal precision. So double x = t - (t - a);
may be implemented as if it had infinite precision, and then it is mathematically equal to a
, so optimization may reduce it to a
. However double x = t - static_cast<double>(t - a);
must produce a double
value for the cast. double t0 = t - a, x = t - t0;
also must produce a double
value for t0
.
(Technically, the implementation could use excess precision for the subtraction, including rounding, and then round it to double
, allowing the possibility of a double rounding error, but I have not heard of this happening.)
Note this assumes standard-conforming behavior. If the compiler has been licensed to perform additional optimizations, as with a switch telling it to be looser with math than the standard specifies, the methods in this answer may be insufficient.
Mainstream compilers are generally fine when compiling for x86-64 which implies using SSE2 for floating point so the hardware format is IEEE754 64-bit double
. Same for most other modern ISAs like PowerPC and ARM / AArch64. So math should be IEEE compliant as long as you avoid options like MSVC /fp:fast
or GCC/Clang -ffast-math
.
GCC's default (with the default -std=gnu++20
or whatever) is -fexcess-precision=fast
, so you'll want -fexcess-precision=standard
, or -std=c++20
which implies that, if you're compiling for 32-bit x86 with x87 floating-point. (SSE2 doesn't have this problem). MSVC targeting x86 with x87 FP sets the FPU precision-control bits so all results are rounded to a 53-bit mantissa (so it's like double
except for the exponent range), avoiding the need for store/reload to actually round to double
. (Or if you linked at Direct3D library, it'll set the precision-control to 24-bit, float
mantissa width, according to Bruce Dawson's excellent articles from 2012 about FP math in general and MSVC/Windows in particular.)
GCC's default is -ffp-contract=fast
, which will allow it to contract t - a
into fma(a, scale, -a)
even across statements when FMA is available (like -march=x86-64-v3
when compiling for x86-64).
MSVC's default is #pragma fp_contract off
; clang's default is -ffp-contract=on
(within one expression, not across statements, like ISO C #pragma STDC FP_CONTRACT on
(also supported by some C++ compilers)). (GCC doesn't support on
, only off
and fast
, so GCC's on
is equivalent to off
: the only way to stop its optimizer contracting across statements is to disable it entirely.)
Classic ICC's default is -fp-model fast=1
, so for this you'd need -fp-model precise
. See also slides from a 2012 talk about FP math optimization in ICC and GCC. The LLVM-based ICX doesn't have the same options, being more like Clang.
Other compilers default to MSVC /fp:precise
or gcc/clang -fno-fast-math
.
Upvotes: 3
Reputation: 493
OK, here is the solution:
Switching from fp:fast
to fp:strict
as mentioned by multiple users in the comments solves the problem. However, I did not want to change global optimization settings to not effect the rest of my code. The solution was to put volatile
before certain variables (if there is a better solution, please comment). Indeed only two changes were necessary:
Function twoSum
:
doubledouble twoSum(const double a, const double b) const
{
volatile double s = a + b;
volatile double v = s - a;
double e = (a - (s - v)) + (b - v);
doubledouble out;
out.hi = s;
out.lo = e;
return (out);
}
... and function split
:
doubledouble split(const double a) const
{
const double split = (1 << 26) + 1;
volatile double t = a * split;
doubledouble out;
out.hi = t - (t - a);
out.lo = a - out.hi;
return (out);
}
Thank you to the commenters.
Upvotes: -1