Reputation: 113
I have the following, easily reproducible problem, when running the following kernel, which does nothing except atomicAdds of floats:
#define OUT_ITERATIONS 20000000
#define BLOCKS 12
#define THREADS 192
__global__ void testKernel(float* result) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
float bias = 1.0f;
int n = 1;
while (i < OUT_ITERATIONS) {
atomicAdd(result, bias);
i += BLOCKS * THREADS;
}
}
The kernel is supposed to increment the result OUT_ITERATIONS times, that is 20M. I call the kernel with this standard code:
int main() {
cudaError_t cudaStatus;
float* result;
float* dev_result;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
result = new float;
cudaStatus = cudaMalloc((void**)&dev_result, sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
cudaStatus = cudaMemset(dev_result, 0, sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemset failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
testKernel<<<BLOCKS, THREADS>>>(dev_result);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
cudaStatus = cudaMemcpy(result, dev_result, sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
printf("Result: %f\n", *result);
However, the result printed at the end is 16777216.0, which is incidentally 0x1000000 in hex. The problem does not occur if OUT_ITERATIONS < 16777216, that is, if I change it to 16777000 for example, sure enough the output is 16777000.0!
System: NVidia-Titan, CUDA 5.5, Windows7
Upvotes: 1
Views: 1224
Reputation: 9771
This issue is due to the limited precision of the type float
.
float
has only 24bit binary precison. If you add 2 numbers where one is more than 2^24-1
times larger than the other, the result will be exactly the same as the larger one.
When you add a big number like 16777216.0(=2^24) with a tiny number like 1.0, you will lost some precison and the result will still be 16777216.0. The same situations happens in a standard C propgram
float a=16777216.0f;
float b=1.0f;
printf("%f\n",a+b);
You could replace float
with double
or int
to solve this problem.
Please refer to cuda doc for the implementation of the double
version of atomicAdd()
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
Upvotes: 7
Reputation: 151799
20M does not fit within the available integer precision in a float
.
A float
quantity does not have 32 bits of mantissa (you discovered how many mantissa bits there are with your observation of "incidentally 0x1000000 in hex"), so it cannot represent all integers in the same way that a int
or unsigned int
can.
16777216 is the largest integer that can be reliably stored in a float
.
Limit your storage range to what will fit in float
, or else use some other representation, such as unsigned int
or double
if you want to reliably store 20M as an integer.
This isn't really a CUDA issue. You'd have similar difficulty trying to store large integers in a float
in host code.
Upvotes: 4