Reputation: 9825
I am trying to use GPU to sum an array with such code:
__global__ void sum_array(int* a, uint n) {
uint idx = threadIdx.x + blockIdx.x * blockDim.x;
for (int s = 1; s < n; s *= 2) {
uint i1 = s * 2 * idx;
uint i2 = s * (2 * idx + 1);
if (i2 < n) {
a[i1] += a[i2];
}
__syncthreads();
}
}
For the test I generated my array as [0, 1, 2 ... 99]
, so the result should be 4950. When I set block as [1024, 1, 1]
and grid as [1, 1]
everything works fine: the value of a[0]
contains the correct result after the calculation. But if I set block=[4, 1, 1]
and grid=[25, 1]
, I get the result 4754 that is wrong (but from time to time, the function provides the correct result). It looks like all the threads are not synced properly in different blocks. How can I fix my code to make it work correctly with multiple blocks? I am going to sum long arrays that are longer than the number of threads I can use, so I need a solution for many blocks (blockDim.x > 1
).
Upvotes: 0
Views: 679
Reputation: 9825
I found this solution:
__global__ void sum_array(int* a, uint n) {
uint tid = threadIdx.x;
uint offset = 2 * blockIdx.x * blockDim.x;
for (uint s = 1; s <= blockDim.x; s *= 2) {
if (tid % s == 0) {
uint idx = 2 * tid + offset;
if (idx + s < n) {
atomicAdd(a + idx, a[idx + s]);
}
}
__syncthreads();
}
if ((offset != 0) && (tid == 0)) {
atomicAdd(a, a[offset]);
}
}
In short, I applied similar algorithm as in the question, but for each block separately (not for the whole array). So after that I needed to add all the results from each block into a[0]
in the end. I also replaced my sum operator with atomicAdd
to ensure the correct adding between blocks (in the end).
Upvotes: 1