Reputation: 11926
I have an opencl kernel batch class that enqueues more than 40 kernels and there is a clFinish() between each kernel execution.
Question: When I comment out the clFinish()s then program runs %100 faster, does this mean there are undefined behaviours between kernels when they use same buffers to write and read? I suspect performance increase comes from better utilisation of resources of gpu/occupation. Does using an in-order command queue guarantee a barrier between two kernels or do I need a clFinish() between them?
Its a simple physics model and no differency is visible between two versions.
Maybe I just need to use clFinish() at the end?
Edit: This command queue is not using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE.
Upvotes: 3
Views: 2808
Reputation: 1020
Perhaps the kernel isn't being run? Enqueuing a kernel is cheap and doesn't require any effort on the part of the host. However, calling clEnqeueNDRangeKernel != Running the kernel.
//Start host timer
clEnqueNDRangeKernel(...);
//Stop host timer - this doesn't actually time the kernel on the host
vs.
//Start host timer
clEnqueueNDRangeKernel(...);
clFinish();
//Stop host timer - this does time the kernel on the host
Alternatively, you can use kernel profiling and events.
Upvotes: 2
Reputation: 1615
You do not need to use clFinish
inbetween kernel calls if all kernel calls execute on the same command queue and this is an in-order queue, which is exactly your scenario so you don't need all those clFinish
calls, just keep the last one.
The performance difference comes from a combination of the following:
clFinish
call to send all queued commands to the device and check their execution status.clFinish
calls also create breaks in the work being done by the GPU - the GPU has to wait for the next kernel to be enqueued and sent to the device after clFinish
. Omitting the calls to clFinish
means that kernel calls get sent to the GPU while previous kernels are still executing.Upvotes: 7