What is the correct way in OpenCL to concatenate results of work-groups?

Question

Suppose that in an OpenCL kernel, each work-group outputs unknown amount of data. Is there any efficient way to align that output on the global memory so that there are no holes in it?

Yann Vernier · Accepted Answer

One method might be to use atomic_add() to acquire an index into an array, once you know how large a chunk your workgroup requires. In OpenCL 1.0, this type of operation required an extension (cl_khr_global_int32_base_atomics). These operations may be very slow, possibly going as far as locking the whole global memory bus (whose latency we tend to avoid like the plague), so you probably don't want to use it on a per item basis. A downside to this scheme is that you don't know the order your results will be stored, as workgroups can (and will) execute out of order.

Another approach is to simply not store the data contiguously, but allocate enough for each workgroup. Once they finish, you can run a second batch of work to rearrange the data as required (most likely into a second buffer, as memmove-like tricks don't parallellize easily). If you're passing the data back to CPU, feel free to just run all clEnqueueReadBuffer calls in one batch before waiting for them to complete. A command queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE might help a bit; you can use the cl_event arguments to specify dependencies when they occur.

What is the correct way in OpenCL to concatenate results of work-groups?

Answers (1)

Related Questions