FFT with CUDA on same data gives different results each time?

Question

Why is it that when using CUDA, if I perform a FFT with a size of 1 million, I get somewhat subtly different results every time?

My hardware has the Fermi architecture.

Patrick87 · Accepted Answer

This might have an easy answer. CUDA programs frequently use the float variable type, as it can be considerably faster than double. The order in which operations are evaluated can significantly affect the final value of a floating point computation; this isn't unique to CUDA, but you might notice the effects particularly acutely since it is such a massively parallel paradigm (and with parallelism comes nondeterminism, at least when doing things like global reductions).

EDIT: Just to be clear, it is a necessary (though insufficient) condition that CUDA not guarantee that the same kernel will be executed in the same order across several executions. If CUDA does guarantee this, then it should not be possible for the order in which arithmetic operations are executed to vary from run to run, and as such, one would not expect to see different values for the same floating-point computation.

Here is a simple C program demonstrating the above claim. Try the code

#include 

int main()
{
   float a = 100.0f, b = 0.00001f, c = 0.00001f;

   printf("a + b + c = %f
", a + b + c);
   printf("b + c + a = %f
", b + c + a);
   printf("a + b + c == b + c + a ? %d
", (a + b + c) == (b + c + a));

   return 0;
}

on Linux and see what you get (I'm using 64-bit RHEL 6 and gcc version 4.4.4-13). My output is the following:

[user@host directory]# gcc add.c -o add
[user@host directory]# ./add
a + b + c = 100.000015
b + c + a = 100.000023
a + b + c == b + c + a ? 0

EDIT: Note that while this program may suggest that the underlying issue is that floating-point addition is non-commutative, it is actually the case that floating-point addition is non-associative (since C evaluates addition operations from left to right, it just so happens that the first addition is performed as (a + b) + c and the second is performed as (b + c) + a). The reason for non-associativity is that floating-point representations can represent only finitely many significant digits (in base 2, but the discussion for a base-10 system is essentially equivalent). For instance, if only three significant digits can be represented, we get (100 + 0.5) + 0.5 = 100 + 0.5 = 100, whereas 100 + (0.5 + 0.5) = 100 + 1 = 101. In the first case, the intermediate result 100 + 0.5 must be truncated (or rounded up, possibly) since it is impossible to represent the intermediate value 100.5 with only three significant digits.

There are a number of important implications of this phenomenon; for instance, you will get a more accurate answer by adding numbers in increasing order of size (exponent). The real take-away is that you shouldn't expect the results to be identical unless the computations are being performed in the same order, which may be hard to guarantee using CUDA on a real GPU.

FFT with CUDA on same data gives different results each time?

Answers (1)

Related Questions