Emanuel Ey
Emanuel Ey

Reputation: 2854

How to correclty sum results from local to global memory in OpenCL

I have an OpenCL kernel in which each workgroup produces a vector of results in local memory. I then need to sum all of these results into global memory for later retrieval to the host.
To test this, i created the following kernel code:

//1st thread in each workgroup initializes local buffer
if(get_local_id(0) == 0){
    for(i=0; i<HYD_DIM; i++){
        pressure_Local[i] = (float2){1.0f, 0.0f};
    }
}

//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);

/// sum all the results into global storage
for(i=0; i<get_num_groups(0); i++){

    //1st thread in each workgroup writes the group's local buffer to global memory
    if(i == get_group_id(0) && get_local_id(0) == 0){
        for(j=0; j<HYD_DIM; j++){
            pressure_Global[j] += pressure_Local[j];
            // barrier(CLK_GLOBAL_MEM_FENCE);
        }
    }

    //flush global memory buffers:
    barrier(CLK_GLOBAL_MEM_FENCE);
}

In essence, I was expecting all elements of the vector in global memory to be equal to the number of workgroups (128 in my case). In reality they generally vary between 60 and 70, and the results change from run to run.
Can someone tell me what it is that i'm missing, or how to do this correctly?

Upvotes: 1

Views: 887

Answers (1)

mfa
mfa

Reputation: 5087

You can't synchronize between different work groups with opencl. CLK_GLOBAL_MEM_FENCE does not work that way. It only guarantees that the order of memory operations (accessed by the work group) will be maintained. See section "6.12.8 Synchronization Functions" in the OCL 1.2 spec.

I would solve your problem by using a different block of global memory for each work group. You write the data to global, and your kernel is finished. Then, if you want to reduce the data down to a single block, you can make another kernel to read the data from global, and merge it with the other blocks of results. You can do as many layers of merging as you want, but the final merge has to be done by a single work group.

Search around for gpu/opencl reduction algorithms. Here's a decent one to start with. Case Study: Simple Reductions

Upvotes: 2

Related Questions