Cuda programming cannot have the same computation accuracy compared with CPU program in terms of float type

Question

I try to use the GPU to accelerate my program which computes L2 distance between two float array. In order to check the computation accuracy, I write both CUDA program and CPU program. However, I found that the total error is more than 200, which I do not understand. I use float type in both cases and I believe that I should get the same result. My code is as the following.

#include 
#include 
#include 
#include 
// #include 

#define VECTORDIM 3


double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}

void DistanceCPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    float temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2); 
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    float temp;

    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
            output[i * narray2 + j] = temp;
        }
    }
}

int main()
{
    int narray1 = 7000;
    int narray2 = 60000;

    float* array1 = new float[narray1 * VECTORDIM];
    float* array2 = new float[narray2 * VECTORDIM];
    float* outputGPU = new float[narray1 * narray2];
    float* outputCPU = new float[narray1 * narray2];
    float* outputCPUTest = new float[narray1 * narray2];

    float* d_array1;
    float* d_array2;
    float* d_output;

    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast (rand() / (static_cast (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }

    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast (rand() / (static_cast (RAND_MAX / 10)));
    }

    cudaError_t err;

    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(float));

    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);

    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads
", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;

    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(float), cudaMemcpyDeviceToHost);

    printf("Total computation time is %lf 
" , iElaps);

    DistanceCPU(array1, array2, narray1, narray2, outputCPU);

    float error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);

    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f 
", outputCPU[i]);
        printf("GPU result %f 
", outputGPU[i]);
    }

    printf("Error is %f 
", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}

I try to print some computation results from both CPU and GPU. I get the following output.

CPU result 84.315201 
GPU result 84.315193 
CPU result 48.804039 
GPU result 48.804039 
CPU result 26.388403 
GPU result 26.388403 
CPU result 150.009735 
GPU result 150.009750

I believe that the float accuracy is enough and I do not know what the real problem is.

Robert Crovella · Accepted Answer

I would say the principal contributor here is the use of the powf function. A particular math function definition on the GPU is not guaranteed to give the same accuracy as the same math function in CPU code. Whether that is a sufficient or even applicable description here, I cannot say, because we'd probably have to have a discussion around what CPU compiler you are using as well as compile switches/settings. The error possibilities for the GPU math functions are documented in the CUDA programming guide.

However, there's really not much point in my opinion to use pow or powf to square things, if what you're interested in is performance. I assume since you're asking a question about GPUs, you are interested in performance.

If we replace the use of the powf function with an ordinary squaring operation, the GPU results become much closer to the CPU results, from what I can see.

Results from running the code as is, on CUDA 10.0, Tesla P100, CentOS 7, gcc 4.8.5:

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000038
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970345
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315193
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009750
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000001

Modified code, replacing powf with ordinary squaring:

$ cat t415.cu
#include 
#include 
#include 
#include 
// #include 

#define VECTORDIM 3
typedef float mt;

double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}

void DistanceCPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    mt temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
#ifndef USE_POW
                temp += (array1[i + l * narray1] - array2[j + l * narray2])*(array1[i + l * narray1] - array2[j + l * narray2]);
#else
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2);
#endif
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    mt temp;

    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
#ifndef USE_POW
            temp += (array1[i] - array2[j])*(array1[i] - array2[j]);
            temp += (array1[i + narray1] - array2[j + narray2])*(array1[i + narray1] - array2[j + narray2]);
            temp += (array1[i + 2 * narray1] - array2[j + 2 * narray2])*(array1[i + 2 * narray1] - array2[j + 2 * narray2]);
#else
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
#endif
            output[i * narray2 + j] = temp;
        }
    }
}

int main()
{
    int narray1 = 7000;
    int narray2 = 60000;

    mt* array1 = new mt[narray1 * VECTORDIM];
    mt* array2 = new mt[narray2 * VECTORDIM];
    mt* outputGPU = new mt[narray1 * narray2];
    mt* outputCPU = new mt[narray1 * narray2];
    mt* outputCPUTest = new mt[narray1 * narray2];

    mt* d_array1;
    mt* d_array2;
    mt* d_output;

    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast (rand() / (static_cast (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }

    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast (rand() / (static_cast (RAND_MAX / 10)));
    }

    cudaError_t err;

    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(mt));

    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);

    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads
", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;

    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(mt), cudaMemcpyDeviceToHost);

    printf("Total computation time is %lf 
" , iElaps);

    DistanceCPU(array1, array2, narray1, narray2, outputCPU);

    mt error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);

    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f 
", outputCPU[i]);
        printf("GPU result %f 
", outputGPU[i]);
    }

    printf("Error is %f 
", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}
$ nvcc -o t415 t415.cu
t415.cu(87): warning: variable "err" was set but never used

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000042
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145149
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092331
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000

Some notes:

There are still some differences which I haven't investigated. The GPU may do FMA contraction differently than CPU code. The next step in the analysis process would be to compare float vs. double computations, to establish a baseline of what number is closer to the correct result. There are situations where the GPU produces a number that is closer to the correct result than the corresponding CPU code, so simply making the assumption that the CPU code is correct and then asking for an explanation of why the GPU code is different is not always the correct approach. Here is an example of this kind of mistake.
If we consider the ordinary-squaring version, its not obvious to me that this code does or would need to have floating-point calculation ordering differences between CPU and GPU versions, therefore I don't think floating-point (lack of) associativity is a primary consideration here. However, I don't have a conclusive explanation to explain the remaining differences; more work would be needed (see previous item).
At least on the GPU, ordinary squaring is likely to be faster than powf( ,2)
Your timing measurement on the GPU code is only capturing the kernel launch overhead. Kernel launches are asynchronous. To capture the full kernel execution duration, add a cudaDeviceSynchronize(); call in the timing region, immediately after the kernel call.

EDIT: Courtesy of @njuffa, who reminded me it's easy to check the FMA contraction hypothesis, if we recompile the previously modified code with -fmad=false, then we observe (at least as far as the print-out goes) identical results between GPU and CPU. So this means FMA contraction (on the GPU side) is likely the final contributor to the few differences remaining in the previous section. As mentioned in the comment by njuffa, FMA contraction is likely to produce higher accuracy results, and so a possible explanation here is that the GPU results (with FMA contraction, as displayed previously) are probably more accurate than the CPU results. Again, switching to double-precision would help to confirm this. The code is already set up to make that easily possible with a typedef change. In any event, here is the output of the previous code (float, with use of ordinary squaring) with -fmad=false:

$ nvcc -o t415 t415.cu -fmad=false
t415.cu(87): warning: variable "err" was set but never used

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000039
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000

Cuda programming cannot have the same computation accuracy compared with CPU program in terms of float type

Answers (1)

Related Questions