Reputation: 33
Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.
Upvotes: 0
Views: 521
Reputation: 11910
Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.
Upvotes: 0
Reputation: 845
Let me try to explain why sum[0]
is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0]
is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx]
is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx]
(I think) expands performs a read of sum[0]
, then adds input[idx]
to that before writing the result back to sum[0]
.
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
sum[0]
before any other thread
writes their updated result back to sum[0]
, in which case the final
result of sum[0]
would be the value of input[idx]
of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.sum[0]
before this slow thread reads sum[0]
, in which case there
will be an addition using the values of more than one thread, but not
all threads.So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.
Upvotes: 3