Martin Thoma
Martin Thoma

Reputation: 136655

Can I calculate error introduced by doubles?

Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.

A decimal representation with a lot of digits is:

1.7320508075688772935274463415058723669428052538103806280558069794519330169088
  00037081146186757248575675...

Now, when I calculate \sqrt{3}, I get 1.732051:

#include <stdio.h> // printf
#include <math.h>   // needed for sqrt

int main() {
    double myVar = sqrt (3);
    printf("as double:\t%f\n", myVar);
}

According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.

Is there any way I can calculate the error myself?

(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)

Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.

My try

While writing this question, I found this:

#include <stdio.h> // printf
#include <math.h>  // needed for sqrt
#include <float.h> // needed for higher precision

int main() {
    long double r = sqrtl(3.0L);
    printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
}

With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:

#include <stdio.h> // printf
#include <math.h>  // needed for sqrt
#include <float.h>

int main() {
    double myVar = sqrt (3);
    long double r = sqrtl(3.0L);
    long double error = abs(r-myVar) / r;
    printf("Double:\t\t%f\n", myVar);
    printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
    printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
}

But it outputs:

Double:     1.732051
Precision:  18 digits; 1.73205080756887729
Error:      0

How can I fix that to get the error?

Upvotes: 2

Views: 973

Answers (6)

Pascal Cuoq
Pascal Cuoq

Reputation: 80325

One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.

Frama-C's value analysis can do this for you with option -all-rounding-modes.

double Frama_C_sqrt(double x);

double sqrt(double x)
{
  return Frama_C_sqrt(x);
}

double y;

int main(){
  y = sqrt(3.0);
}

Analyzing the program with:

frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
      y ∈ [1.7320508075688772 .. 1.7320508075688774]

This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].

Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.

Upvotes: 1

Eric Postpischil
Eric Postpischil

Reputation: 223737

You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:

Double:     1.732051
Precision:  18 digits; 1.73205080756887729
Error:      5.79643049346087304e-17

(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)

Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:

  • If you are calculating the more accurate long double results, why bother with double?
  • Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
  • There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)

In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.

In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.

You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x'x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.

Is there some class of numbers you want to perform this analysis for?

Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.

Upvotes: 0

wirrbel
wirrbel

Reputation: 3299

What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.

https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf

Upvotes: 3

teppic
teppic

Reputation: 8195

printf rounds doubles to 6 places when you use %f without a precision.

e.g.

double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);

My output: -0.00000000000000004445

If the result is 0, your long double and double are the same.

Upvotes: 1

Art
Art

Reputation: 20402

According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.

It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.

Upvotes: 0

user1944441
user1944441

Reputation:

You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);

The actual value of double myVar is

1.732050807568877281 //18 digits

so 1.732050807568877281-1.732050807568877281 is zero

Upvotes: 0

Related Questions