OpenCL: multiple work items saving results to the same global memory address

Question

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:

__kernel void photometry(__global float* stamp, 
                         __constant float* dark,
                         __global float* output)
{
int x = get_global_id(0);
int s = n * n;

if(x < s){
    float2 curr_px = (float2)((x / n), (x % n));
    float2 center = (float2)(centerX, centerY);
    int dist = (int)fast_distance(center, curr_px);
    if(dist < aperture){
        output[0] += stamp[x]-dark[x];
        output[1]++;
    }else if (dist > sky_inner && dist < sky_outer){
        output[2] += stamp[x]-dark[x];
        output[3]++;
    }
 }
}

All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.

I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.

Robert Wang · Accepted Answer

Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.

The problem is here:

output[0] += stamp[x]-dark[x];
output[1]++;

You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).

However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.

OpenCL: multiple work items saving results to the same global memory address

Answers (2)

Related Questions