How do I sum an array on GPU with CUDA?

Question

I am trying to use GPU to sum an array with such code:

__global__ void sum_array(int* a, uint n) {
    uint idx = threadIdx.x + blockIdx.x * blockDim.x;

    for (int s = 1; s < n; s *= 2) {
        uint i1 = s * 2 * idx;
        uint i2 = s * (2 * idx + 1);
        if (i2 < n) {
            a[i1] += a[i2];
        }
        __syncthreads();
    }
}

For the test I generated my array as [0, 1, 2 ... 99], so the result should be 4950. When I set block as [1024, 1, 1] and grid as [1, 1] everything works fine: the value of a[0] contains the correct result after the calculation. But if I set block=[4, 1, 1] and grid=[25, 1], I get the result 4754 that is wrong (but from time to time, the function provides the correct result). It looks like all the threads are not synced properly in different blocks. How can I fix my code to make it work correctly with multiple blocks? I am going to sum long arrays that are longer than the number of threads I can use, so I need a solution for many blocks (blockDim.x > 1).

How do I sum an array on GPU with CUDA?

Answers (1)

Related Questions