Reputation: 1021
I try to use the GPU to accelerate my program which computes L2 distance between two float array. In order to check the computation accuracy, I write both CUDA program and CPU program. However, I found that the total error is more than 200, which I do not understand. I use float type in both cases and I believe that I should get the same result. My code is as the following.
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>
#define VECTORDIM 3
double cpuSecond()
{
struct timeval tp;
gettimeofday(&tp, NULL);
return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}
void DistanceCPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
float temp;
for (int i = 0; i < narray1; i++)
{
for (int j = 0; j < narray2; j++)
{
temp = 0;
for (int l = 0; l < VECTORDIM; l++)
{
temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2);
}
output[i * narray2 + j] = temp;
}
}
}
__global__ void DistGPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
float temp;
if (i < narray1)
{
for (int j = 0; j < narray2; j++)
{
temp = 0;
temp += powf(array1[i] - array2[j], 2);
temp += powf(array1[i + narray1] - array2[j + narray2], 2);
temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
output[i * narray2 + j] = temp;
}
}
}
int main()
{
int narray1 = 7000;
int narray2 = 60000;
float* array1 = new float[narray1 * VECTORDIM];
float* array2 = new float[narray2 * VECTORDIM];
float* outputGPU = new float[narray1 * narray2];
float* outputCPU = new float[narray1 * narray2];
float* outputCPUTest = new float[narray1 * narray2];
float* d_array1;
float* d_array2;
float* d_output;
for (int i = 0; i < narray1 * VECTORDIM; i++)
{
array1[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
// std::cout << "Element " << i << " " << array1[i] << std::endl;
}
for (int i = 0; i < narray2 * VECTORDIM; i++)
{
array2[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
}
cudaError_t err;
err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(float));
err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(float));
err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(float));
err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
int threadsPerBlock = 512;
int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
double iStart = cpuSecond();
DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
double iElaps = cpuSecond() - iStart;
err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(float), cudaMemcpyDeviceToHost);
printf("Total computation time is %lf \n" , iElaps);
DistanceCPU(array1, array2, narray1, narray2, outputCPU);
float error = 0;
for (long i = 0; i < narray1 * narray2; i++)
{
error += abs(outputCPU[i] - outputGPU[i]);
}
error /= (narray2 * narray1);
for (int i = 0; i < 20; i++)
{
printf("CPU result %f \n", outputCPU[i]);
printf("GPU result %f \n", outputGPU[i]);
}
printf("Error is %f \n", error);
delete [] array1;
delete [] array2;
delete [] outputCPU;
delete [] outputGPU;
return 0;
}
I try to print some computation results from both CPU and GPU. I get the following output.
CPU result 84.315201
GPU result 84.315193
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009750
I believe that the float accuracy is enough and I do not know what the real problem is.
Upvotes: 3
Views: 1172
Reputation: 151799
I would say the principal contributor here is the use of the powf
function. A particular math function definition on the GPU is not guaranteed to give the same accuracy as the same math function in CPU code. Whether that is a sufficient or even applicable description here, I cannot say, because we'd probably have to have a discussion around what CPU compiler you are using as well as compile switches/settings. The error possibilities for the GPU math functions are documented in the CUDA programming guide.
However, there's really not much point in my opinion to use pow
or powf
to square things, if what you're interested in is performance. I assume since you're asking a question about GPUs, you are interested in performance.
If we replace the use of the powf
function with an ordinary squaring operation, the GPU results become much closer to the CPU results, from what I can see.
Results from running the code as is, on CUDA 10.0, Tesla P100, CentOS 7, gcc 4.8.5:
$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000038
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970345
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315193
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009750
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000001
Modified code, replacing powf with ordinary squaring:
$ cat t415.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>
#define VECTORDIM 3
typedef float mt;
double cpuSecond()
{
struct timeval tp;
gettimeofday(&tp, NULL);
return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}
void DistanceCPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
mt temp;
for (int i = 0; i < narray1; i++)
{
for (int j = 0; j < narray2; j++)
{
temp = 0;
for (int l = 0; l < VECTORDIM; l++)
{
#ifndef USE_POW
temp += (array1[i + l * narray1] - array2[j + l * narray2])*(array1[i + l * narray1] - array2[j + l * narray2]);
#else
temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2);
#endif
}
output[i * narray2 + j] = temp;
}
}
}
__global__ void DistGPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
mt temp;
if (i < narray1)
{
for (int j = 0; j < narray2; j++)
{
temp = 0;
#ifndef USE_POW
temp += (array1[i] - array2[j])*(array1[i] - array2[j]);
temp += (array1[i + narray1] - array2[j + narray2])*(array1[i + narray1] - array2[j + narray2]);
temp += (array1[i + 2 * narray1] - array2[j + 2 * narray2])*(array1[i + 2 * narray1] - array2[j + 2 * narray2]);
#else
temp += powf(array1[i] - array2[j], 2);
temp += powf(array1[i + narray1] - array2[j + narray2], 2);
temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
#endif
output[i * narray2 + j] = temp;
}
}
}
int main()
{
int narray1 = 7000;
int narray2 = 60000;
mt* array1 = new mt[narray1 * VECTORDIM];
mt* array2 = new mt[narray2 * VECTORDIM];
mt* outputGPU = new mt[narray1 * narray2];
mt* outputCPU = new mt[narray1 * narray2];
mt* outputCPUTest = new mt[narray1 * narray2];
mt* d_array1;
mt* d_array2;
mt* d_output;
for (int i = 0; i < narray1 * VECTORDIM; i++)
{
array1[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
// std::cout << "Element " << i << " " << array1[i] << std::endl;
}
for (int i = 0; i < narray2 * VECTORDIM; i++)
{
array2[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
}
cudaError_t err;
err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(mt));
err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(mt));
err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(mt));
err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
int threadsPerBlock = 512;
int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
double iStart = cpuSecond();
DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
double iElaps = cpuSecond() - iStart;
err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(mt), cudaMemcpyDeviceToHost);
printf("Total computation time is %lf \n" , iElaps);
DistanceCPU(array1, array2, narray1, narray2, outputCPU);
mt error = 0;
for (long i = 0; i < narray1 * narray2; i++)
{
error += abs(outputCPU[i] - outputGPU[i]);
}
error /= (narray2 * narray1);
for (int i = 0; i < 20; i++)
{
printf("CPU result %f \n", outputCPU[i]);
printf("GPU result %f \n", outputGPU[i]);
}
printf("Error is %f \n", error);
delete [] array1;
delete [] array2;
delete [] outputCPU;
delete [] outputGPU;
return 0;
}
$ nvcc -o t415 t415.cu
t415.cu(87): warning: variable "err" was set but never used
$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000042
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145149
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092331
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000
Some notes:
float
vs. double
computations, to establish a baseline of what number is closer to the correct result. There are situations where the GPU produces a number that is closer to the correct result than the corresponding CPU code, so simply making the assumption that the CPU code is correct and then asking for an explanation of why the GPU code is different is not always the correct approach. Here is an example of this kind of mistake.powf( ,2)
cudaDeviceSynchronize();
call in the timing region, immediately after the kernel call.EDIT: Courtesy of @njuffa, who reminded me it's easy to check the FMA contraction hypothesis, if we recompile the previously modified code with -fmad=false
, then we observe (at least as far as the print-out goes) identical results between GPU and CPU. So this means FMA contraction (on the GPU side) is likely the final contributor to the few differences remaining in the previous section. As mentioned in the comment by njuffa, FMA contraction is likely to produce higher accuracy results, and so a possible explanation here is that the GPU results (with FMA contraction, as displayed previously) are probably more accurate than the CPU results. Again, switching to double-precision would help to confirm this. The code is already set up to make that easily possible with a typedef
change. In any event, here is the output of the previous code (float
, with use of ordinary squaring) with -fmad=false
:
$ nvcc -o t415 t415.cu -fmad=false
t415.cu(87): warning: variable "err" was set but never used
$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000039
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000
Upvotes: 6