Rookie
Rookie

Reputation: 4144

Should I combine multiplication and division steps when working with floating point values?

I am aware of the precision problems in floats and doubles, which why I am asking this:

If I have a formula such as: (a/PI)*180.0 (where PI is a constant)

Should I combine the division and multiplication, so I can use only one division: a/0.017453292519943295769236, in order to avoid loss of precision ?

Does this make it more precise when it has less steps to calculate the result?

Upvotes: 3

Views: 305

Answers (1)

Pascal Cuoq
Pascal Cuoq

Reputation: 80355

Short answer

Yes, you should in general combine as many multiplications and divisions by constants as possible into one operation. It is (in general(*)) faster and more accurate at the same time.

Neither π nor π/180 nor their inverses are representable exactly as floating-point. For this reason, the computation will involve at least one approximate constant (in addition to the approximation of each of the operations involved).

Because two operations introduce one approximation each, it can be expected to be more accurate to do the whole computation in one operation.

In the case at hand, is division or multiplication better?

Apart from that, it is a question of “luck” whether the relative accuracy to which π/180 can be represented in the floating-point format is better or worse than that of 180/π.

My compiler provides addition precision with the long double type, so I am able to use it as reference for answering this question for double:

~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L

#include <stdio.h>

int main() {

  long double heop = 180.L / PIL;
  long double pohe = PIL / 180.L;
  printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
  printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out 
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17

In usual programming practice, one wouldn't bother and simply multiply by (the floating-point representation of) 180/π, because multiplication is so much faster than division. As it turns out, in the case of the binary64 floating-point type double almost always maps to, π/180 can be represented with better relative accuracy than 180/π, so π/180 is the constant one should use to optimize accuracy: a / ((double) (π / 180)). With this formula, the total relative error would be approximately the sum of the relative error of the constant (1.688893e-17) and of the relative error of the division (which will depend on the value of a but never be more than 2-53).

Alternative methods for faster and more accurate results

Note that division is so expensive that you could get an even more accurate result faster by using one multiplication and one fma: let heop1 be the best double approximation of 180/π, and heop2 the best double approximation of 180/π - heop1. Then the best value for the result can be computed as:

double r = fma(a, heop1, a * heop2);

The fact that the above is the absolute best possible double approximation to the real computation is a theorem (in fact, it is a theorem with exceptions. The details can be found in the “Handbook of Floating-Point Arithmetic”). But even when the real constant you want to multiply a double by in order to get a double result is one of the exceptions to the theorem, the above computation is still clearly very accurate and only differs from the best double approximation for a few exceptional values of a.


If, like mine, your compiler provides more precision for long double than for double, you can also use one long double multiplication:

// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)

This is not as good as the solution based on fma, but it is good enough that for most values of a, it produces the optimal double approximation to the real computation.

A counter-example to the general claim that operations should be grouped as one

(*) The claim that it is better to group constant is only statistically true for most constants.

If you happened to wish to multiply a by, say, the real constant 0.0000001 * DBL_MIN, you would be better off multiplying first by 0.0000001, then by DBL_MIN, and the end result (which can be a normalized number if a is larger than 1000000 or so) would be more precise than if you had multiplied by the best double representation of 0.0000001 * DBL_MIN. This is because the relative accuracy when representing 0.0000001 * DBL_MIN as a single double value is much worse than the accuracy for representing 0.0000001.

Upvotes: 4

Related Questions