Reputation: 15
While debugging some CUDA code I was comparing to equivalent CPU code using printf
statements, and noticed that in some cases my results differed; they weren't necessarily wrong on either platform, as they were within floating point rounding errors, but I am still interested in knowing what gives rise to this difference.
I was able to track the problem down to differing dot product results. In both the CUDA and host code I have vectors a and b of type float4
. Then, on each platform, I compute the dot product and print the result, using this code:
printf("a: %.24f\t%.24f\t%.24f\t%.24f\n",a.x,a.y,a.z,a.w);
printf("b: %.24f\t%.24f\t%.24f\t%.24f\n",b.x,b.y,b.z,b.w);
float dot_product = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
printf("a dot b: %.24f\n",dot_product);
and the resulting printout for the CPU is:
a: 0.999629139900207519531250 -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552 0.033134069293737411499023 0.988499701023101806640625 1.000000000000000000000000
a dot b: -0.001397025771439075469971
and for the CUDA kernel:
a: 0.999629139900207519531250 -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552 0.033134069293737411499023 0.988499701023101806640625 1.000000000000000000000000
a dot b: -0.001397024840116500854492
As you can see, the values for a and b seem to be bitwise equivalent on both platforms, but the result of the exact same code differs ever so slightly. It is my understanding that floating point multiplication is well-defined as per the IEEE 754 Standard and is hardware-independent. However, I do have two hypotheses as to why I am not seeing the same results:
Upvotes: 1
Views: 1919
Reputation: 26085
Except for merging FMUL and FADD into FMA (which can be turned off with the nvcc command line switch -fmad=false
), the CUDA compiler observes the evaluation order prescribed by C/C++. Depending on how your CPU code is compiled, it may use a wider precision than single precision to accumulate the dot product, which then yields a different result.
For GPU code, merging of FMUL/FADD into FMA is a common occurrence, so are the resulting numerical differences. The CUDA compiler performs aggressive FMA merging for performance reasons. Use of FMA usually also results in more accurate results, since the number of rounding steps is reduced, and there is some protection against subtractive cancellation as FMA maintains the full-width product internally. I would suggest reading the following whitepaper, as well as the references it cites:
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
To get the CPU and GPU results to match for a sanity check, you would want to turn off FMA-merging in the GPU code with -fmad=false
, and on the CPU enforce that each intermediate result is stored in single precision:
volatile float p0,p1,p2,p3,dot_product;
p0=a.x*b.x;
p1=a.y*b.y;
p2=a.z*b.z;
p3=a.w*b.w;
dot_product=p0;
dot_product+=p1;
dot_product+=p2;
dot_product+=p3;
Upvotes: 4