OpenCL: multiple work items saving results to the same global memory address

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:

__kernel void photometry(__global float* stamp, 
                         __constant float* dark,
                         __global float* output)
{
int x = get_global_id(0);
int s = n * n;

if(x < s){
    float2 curr_px = (float2)((x / n), (x % n));
    float2 center = (float2)(centerX, centerY);
    int dist = (int)fast_distance(center, curr_px);
    if(dist < aperture){
        output[0] += stamp[x]-dark[x];
        output[1]++;
    }else if (dist > sky_inner && dist < sky_outer){
        output[2] += stamp[x]-dark[x];
        output[3]++;
    }
 }
}

All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.

I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.

Upvotes: 1

Views: 1586

Answers (2)

erre4
erre4

Reputation: 23

You can do this in O(log2(n)) concurrently. a concept idea:

You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.

you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16

then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16

then always concurrently 4 in 8, 10 in 16

and finally 8 in 16

everything done in O(log2(n)) in our case in 4 passages.

Upvotes: 0

Robert Wang
Robert Wang

Reputation: 1221

Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.

The problem is here:

output[0] += stamp[x]-dark[x];
output[1]++;

You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).

However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.

Upvotes: 3

Related Questions