user2412789
user2412789

Reputation: 113

CUDA - atomicAdd only adds up to 16777216

I have the following, easily reproducible problem, when running the following kernel, which does nothing except atomicAdds of floats:

#define OUT_ITERATIONS 20000000
#define BLOCKS 12
#define THREADS 192

__global__ void testKernel(float* result) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    float bias = 1.0f;
    int n = 1;

    while (i < OUT_ITERATIONS) {
        atomicAdd(result, bias);
        i += BLOCKS * THREADS;
    }
}

The kernel is supposed to increment the result OUT_ITERATIONS times, that is 20M. I call the kernel with this standard code:

int main() {
cudaError_t cudaStatus;
float* result;
float* dev_result;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

result = new float;
cudaStatus = cudaMalloc((void**)&dev_result, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}
cudaStatus = cudaMemset(dev_result, 0, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemset failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
testKernel<<<BLOCKS, THREADS>>>(dev_result);

// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

cudaStatus = cudaMemcpy(result, dev_result, sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

printf("Result: %f\n", *result);

However, the result printed at the end is 16777216.0, which is incidentally 0x1000000 in hex. The problem does not occur if OUT_ITERATIONS < 16777216, that is, if I change it to 16777000 for example, sure enough the output is 16777000.0!

System: NVidia-Titan, CUDA 5.5, Windows7

Upvotes: 1

Views: 1224

Answers (2)

kangshiyin
kangshiyin

Reputation: 9771

This issue is due to the limited precision of the type float.

float has only 24bit binary precison. If you add 2 numbers where one is more than 2^24-1 times larger than the other, the result will be exactly the same as the larger one.

When you add a big number like 16777216.0(=2^24) with a tiny number like 1.0, you will lost some precison and the result will still be 16777216.0. The same situations happens in a standard C propgram

float a=16777216.0f;
float b=1.0f;
printf("%f\n",a+b);

You could replace float with double or int to solve this problem.

Please refer to cuda doc for the implementation of the double version of atomicAdd()

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Upvotes: 7

Robert Crovella
Robert Crovella

Reputation: 151799

20M does not fit within the available integer precision in a float.

A float quantity does not have 32 bits of mantissa (you discovered how many mantissa bits there are with your observation of "incidentally 0x1000000 in hex"), so it cannot represent all integers in the same way that a int or unsigned int can.

16777216 is the largest integer that can be reliably stored in a float.

Limit your storage range to what will fit in float, or else use some other representation, such as unsigned int or double if you want to reliably store 20M as an integer.

This isn't really a CUDA issue. You'd have similar difficulty trying to store large integers in a float in host code.

Upvotes: 4

Related Questions