Reputation: 11

Why does atan2 in CUDA kernel results in slightly different values than C/fortran?

please, help me with this problem:

For x = -6.5799015957503127E+02 and y = -4.6102597302044005E+03

// in C:
atan2(x,y) = -2.99982704867761151845684253203217e+00

// in Fortran:
atan2(x,y) = -2.99982704867761151845684253203217D+00

// But atan2 called in CUDA kernel is:
atan2(x,y) = -2.99982704867761107436763268196955E+00
//                             ^^^^^^^^^^^^^^^^^^^^^

Yes, it could be due to round off errors, but, why is the result same in Fortran and C, and in CUDA slightly differs?

I need in CUDA the same number of atan2 as in Fortran and C. How to do that?

Upvotes: 0

Answers (2)

Robert Crovella

Reputation: 151799

Note that your values of x and y (expressed in decimal) are not exactly representable as binary (floating-point) numbers, within the range of a double. The following program demonstrates this:

$ cat t556.cu
#include <stdio.h>
#include <math.h>

__global__ void mykernel(double x, double y, double *res){

  *res = atan2(x,y);
}

int main(){

  double h_x = -6.5799015957503127E+02;
  double h_y = -4.6102597302044005E+03;
  double *d_res;
  double my_res = atan2(h_x, h_y);

  cudaMalloc(&d_res, sizeof(double));
  mykernel<<<1,1>>>(h_x, h_y, d_res);
  double h_res;
  cudaMemcpy(&h_res, d_res, sizeof(double), cudaMemcpyDeviceToHost);

  printf("x = %.32lf\n", h_x);
  printf("y = %.32lf\n\n", h_y);
  printf("hst = %.32lf\n", my_res);
  printf("dev = %.32lf\n\n", h_res);

  printf("hst bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&my_res)));
  printf("dev bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&h_res)));


  return 0;

}
$ nvcc -arch=sm_20 -o t556 t556.cu
$ ./t556
x = -657.99015957503127083327854052186012
y = -4610.25973020440051186596974730491638

hst = -2.99982704867761151845684253203217
dev = -2.99982704867761107436763268196955

hst bin = 0xc007ffa552ddcff5
dev bin = 0xc007ffa552ddcff4
$

We see that when we specify x according to what you have written in your question, and then print it out with a large number of digits, the printout does not match the value "presumably" assigned by the code.

The above program also demonstrates that the result of atan2 computed on the host and device differ by 1 bit, in the least significant place of the mantissa of the result.

Referring to the CUDA math documentation in the programming guide, we see that the maximum error of the atan2 function is 2 ULP (ULP = units in last place, for this discussion it is equivalent to the units represented by the least significant bit of the mantissa).

This means that the atan2 function (in CUDA device code) is not guaranteed to produce a numerically correct (for full arbitrary precision throughout) result, and the result it produces may differ by 1 or 2 ULP from a numerically correct (full arbitrary precision throughout) IEEE-754 implementation. This means that when comparing the CUDA implementation of atan2 to another implementation, it is reasonable to assume that there may be a difference of 1 or 2 mantissa LSB in the results.

If you require the atan2 result computed by the CUDA device to perfectly match (no mantissa bits are different) the atan2 result of another implementation, then the atan2 function provided by the CUDA math library will not be usable for you.

The only advice I can give you at that point would be to create your own implementation of atan2, using more basic floating point operations throughout (presumably choosing ones throughout that offer 0 ULP error in the CUDA math library, although I am not an expert on how this would be done in detail), and numerical methods which are designed to match the implementation you are comparing to.

This may be an informative read as well. Note that the GNU implementations do not necessarily imply 0 ULP error across all math functions, or even all trig-type functions. For example note that cos appears to have a maximum 1 ULP error on IA64. However atan2 appears to be at the lowest error level implied.

Upvotes: 3

Pascal Cuoq

Reputation: 80255

I use GPU arch sm_20. And all functions in my project (sin, cos, sinh, cosh, etc.) give same results as in C. So, why atan2 does not???? :(

Most implementations of trigonometric functions nowadays are precise to a trifle above 0.5 ULP, meaning that in 99% of cases, the exact mathematical result has only one representable floating-point approximation that they can return (which is the closest approximation to the real result).

However, you shouldn't assume that any trigonometric function is perfect, that is, accurate to 0.5 ULP, unless you have chosen your math library for this property. This means that, for some rare arguments where the mathematical result is nearly exactly in-between two representable double, a trigonometric function can return the wrong one (say the one that is at a distance of 0.507 ULP instead of the one that is at a distance of 0.493 ULP).

This also means that two different implementations can return different results (an implementation can correctly return the result to 0.493 ULP and the other the wrong result to 0.507 ULP).

This can happen for all trigonometric functions. You just happened to run into this problem with atan2, but the same thing could have happened to you with sin or cos. It may be the case that in one of the libraries you use, atan2 is implemented less accurately (say, to 0.52 ULP instead of 0.505 ULP), making the issue more likely to be noticed. But unless you use correctly rounded libraries on both sides, or the same library (which will not be correctly rounded but will produce the same errors on both sides), this will happen from time to time.

One example of correctly rounded math library, which produces the same results as any other correctly rounded math library, is CRlibm. One example of math library that isn't too bad and is frequently embedded into programs so that they produce the same results everywhere is netlib.

Upvotes: 2

Why does atan2 in CUDA kernel results in slightly different values than C/fortran?

Answers (2)

Related Questions