SU3
SU3

Reputation: 5387

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)? If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?

Edit: Let's assume x86_64 processor and gcc compiler.

Edit: I tried comparing using some random numbers. For example:

printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037

So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.

Upvotes: 7

Views: 387

Answers (4)

Eric Postpischil
Eric Postpischil

Reputation: 222362

In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.

In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.

Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.

Each of the errors, e0e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.

Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]

Consider (1+e0)x and (1+e2)x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)x, (1−u/2)x]. When x > 0, the second interval is longer than the first:

  • The length of the first is (1+u/2)x−(1+u/2)x.
  • The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
  • Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
  • 1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
  • Since the first length equals the second length times a number less than one, the first interval is shorter.

Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.

Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.

Upvotes: 2

Daniel Langr
Daniel Langr

Reputation: 23497

For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:

#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>

namespace mp = boost::multiprecision;

template <typename FLOAT>
void comp() {
  FLOAT a = 8.72138221;
  FLOAT b = 1.761329479;
  FLOAT c = 1.51231;

  FLOAT e = mp::pow(a / b, -c);
  FLOAT f = mp::pow(b / a, c);

  std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
  std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}

int main() {
  std::cout << "Double: " << std::endl;
  comp<mp::cpp_bin_float_double>();
  td::cout << std::endl;

  std::cout << "Double extended: " << std::endl;
  comp<mp::cpp_bin_float_double_extended>();
  std::cout << std::endl;

  std::cout << "Quad: " << std::endl;
  comp<mp::cpp_bin_float_quad>();
  std::cout << std::endl;

  std::cout << "Dec-100: " << std::endl;
  comp<mp::cpp_dec_float_100>();
  std::cout << std::endl;
}

Its output reads, on my platform:

Double: 
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618

Double extended: 
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241

Quad: 
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527

Dec-100: 
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004

Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6

For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.


Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

Upvotes: 0

chux
chux

Reputation: 153348

... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?

Whichever division is more arcuate.


Consider z = xy = 2y * log2(x).

Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.

In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:


a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.

Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.

pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).

Lacking assurance about the error in the division, the general rule applies: no major difference.

Upvotes: 2

geza
geza

Reputation: 29942

Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.

Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".

Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.

For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.

Here's the code (note, this code supposes IEEE-754 arithmetic):

#include <cmath>
#include <stdio.h>
#include <string.h>

long long int diff(float a, float b) {
    unsigned int ai, bi;
    memcpy(&ai, &a, 4);
    memcpy(&bi, &b, 4);
    long long int diff = (long long int)ai - bi;
    if (diff<0) diff = -diff;
    return diff;
}

int main() {
    long long int e0 = 0;
    long long int e1 = 0;
    for (int i=0; i<10000000; i++) {
        float a = 1.0f*rand()/RAND_MAX;
        float b = 1.0f*rand()/RAND_MAX;
        float n = 4.0f*rand()/RAND_MAX - 2.0f;

        if (a==0||b==0) continue;

        float v0 = std::pow(a/b, n);
        float v1 = std::pow(b/a, -n);
        float ref = std::pow((double)a/b, n);

        e0 += diff(ref, v0);
        e1 += diff(ref, v1);
    }

    printf("%lld %lld\n", e0, e1);
}

Upvotes: 2

Related Questions