Reputation: 9292
I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?
Upvotes: 1
Views: 569
Reputation: 9925
If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel
can optionally return an event object, which can be passed to subsequent commands as dependencies.
Upvotes: 5