Batko
Batko

Reputation: 33

OpenCL data parallel summation into a variable

Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?

Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following

    __kernel void summation(__global float* input, __global float* sum)
{
    int idx = get_global_id(0);

    sum[0] += input[idx];
}

When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.

Upvotes: 0

Views: 521

Answers (2)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11910

Option 3 (and worst of all)

    __kernel void summation(__global float* input, __global float* sum)
{
    int idx = get_global_id(0);
    for(int j=0;j<N;j++)
    {
        barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
        if(idx==j)
         sum[0] += input[idx];
        else
         doOtherWorkWhileSingleCoreSums();

    }
}

using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.

A cpu device could be better than gpu for this kind.

Upvotes: 0

RobClucas
RobClucas

Reputation: 845

Let me try to explain why sum[0] is overwritten rather than updated.

In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.

Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].

There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:

  • All threads may read the value of sum[0] before any other thread writes their updated result back to sum[0], in which case the final result of sum[0] would be the value of input[idx] of the thread which executed the slowest. Since this will be different each time, if you run the example multiple times you should see different results.
  • Or, one thread may execute slightly more slowly, in which case another thread may have already written an updated result back to sum[0] before this slow thread reads sum[0], in which case there will be an addition using the values of more than one thread, but not all threads.

So how can you avoid this?

Option 1 - Atomics (Worse Option):

You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).

Option 2 - Reduction (Better Option):

The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.

Upvotes: 3

Related Questions