Reputation: 181
As well known OpenCL barrier() function works only for single workgroup, and there is no direct possibility to synchronize workgroups. If it possible what's best approach for global synchronization today? Using atomics, OpenCL 2.0 features, etc.?
Github links, examples are welcome!
Thankx!
Upvotes: 4
Views: 6004
Reputation: 101
If a command_queue is configured for in-order processing, global syncronisation can be achieved through the ordering of sequential kernels. There is no explicit barrier() call, just kernel1 which is enqueued prior to kernel2. If the command queue is configured for in-order processing, kernel1 will complete all work before kernel2 starts. You will need to have a buffer shared between the two kernels to pass information between them.
In-order processing is the default. There is no need to call finish() between kernels.
The command queue can be configured with clCreateCommandQueueWithProperties and setting the properties to CL_QUEUE_OUT_OF_ORDER_EXEC_MODE if out of order queue execution is required. In that case finish() is would be required to ensure synchronisation.
Upvotes: 0
Reputation: 246
While global synchronization has no succinct in-kernel API call, if the compute device supports the OpenCL extension cl_khr_global_int32_base_atomics, it may be implemented using atomics.
Please see Xiao et al.'s paper that evaluates lock and lock-free approaches to global synchronization on GPUs. http://synergy.cs.vt.edu/pubs/papers/xiao-ipdps2010-gpusync.pdf
This is mentioned in another stackoverflow post found here: OpenCL and GPU global synchronization
Upvotes: 4
Reputation: 5087
Global syncronization within a kernel is not possible. This is because work groups are not gauranteed to be run at the same time. You can achieve a sort of global sync in the host application if you break your kernel into pieces. This is not suitable for many kernels, espeically if you use a lot of local memory or have a bit of initialization code before your kernel does any real work.
Break you kernel into two pars -- kernelA and kernelB for example. Global syncronization is simply a matter of running the NDRange for kernelA, then finish(), and NDRange for kernelB. The global data will remain in memory between the two calls.
Again, not pretty and not necessarily high performance, but if you really must have global sync, this is the only way to get it.
Upvotes: 6