huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11926

Commenting clfinish() out makes program %100 faster

I have an opencl kernel batch class that enqueues more than 40 kernels and there is a clFinish() between each kernel execution.

Question: When I comment out the clFinish()s then program runs %100 faster, does this mean there are undefined behaviours between kernels when they use same buffers to write and read? I suspect performance increase comes from better utilisation of resources of gpu/occupation. Does using an in-order command queue guarantee a barrier between two kernels or do I need a clFinish() between them?

Its a simple physics model and no differency is visible between two versions.

Maybe I just need to use clFinish() at the end?

Edit: This command queue is not using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE.

Upvotes: 3

Views: 2808

Answers (2)

Austin
Austin

Reputation: 1020

Perhaps the kernel isn't being run? Enqueuing a kernel is cheap and doesn't require any effort on the part of the host. However, calling clEnqeueNDRangeKernel != Running the kernel.

//Start host timer
clEnqueNDRangeKernel(...);
//Stop host timer - this doesn't actually time the kernel on the host

vs.

//Start host timer
clEnqueueNDRangeKernel(...);
clFinish();
//Stop host timer - this does time the kernel on the host

Alternatively, you can use kernel profiling and events.

Upvotes: 2

chippies
chippies

Reputation: 1615

You do not need to use clFinish inbetween kernel calls if all kernel calls execute on the same command queue and this is an in-order queue, which is exactly your scenario so you don't need all those clFinish calls, just keep the last one.

The performance difference comes from a combination of the following:

  • There is overhead in each clFinish call to send all queued commands to the device and check their execution status.
  • These clFinish calls also create breaks in the work being done by the GPU - the GPU has to wait for the next kernel to be enqueued and sent to the device after clFinish. Omitting the calls to clFinish means that kernel calls get sent to the GPU while previous kernels are still executing.

Upvotes: 7

Related Questions