How to correclty sum results from local to global memory in OpenCL

Question

I have an OpenCL kernel in which each workgroup produces a vector of results in local memory. I then need to sum all of these results into global memory for later retrieval to the host.
To test this, i created the following kernel code:

//1st thread in each workgroup initializes local buffer
if(get_local_id(0) == 0){
    for(i=0; i



In essence, I was expecting all elements of the vector in global memory to be equal to the number of workgroups (128 in my case). In reality they generally vary between 60 and 70, and the results change from run to run.

Can someone tell me what it is that i'm missing, or how to do this correctly?

mfa · Accepted Answer

You can't synchronize between different work groups with opencl. CLK_GLOBAL_MEM_FENCE does not work that way. It only guarantees that the order of memory operations (accessed by the work group) will be maintained. See section "6.12.8 Synchronization Functions" in the OCL 1.2 spec.

I would solve your problem by using a different block of global memory for each work group. You write the data to global, and your kernel is finished. Then, if you want to reduce the data down to a single block, you can make another kernel to read the data from global, and merge it with the other blocks of results. You can do as many layers of merging as you want, but the final merge has to be done by a single work group.

Search around for gpu/opencl reduction algorithms. Here's a decent one to start with. Case Study: Simple Reductions

How to correclty sum results from local to global memory in OpenCL

Answers (1)

Related Questions