Madhav
Madhav

Reputation: 104

What is the purpose of "clRetainKernel" function?

It is not possible to use the same kernel object to parallelly execute two instances of the same kernel.

To parallelly execute multiple instances of the same kernel, multiple kernel objects need to be created from the same program object and enqueued to different command-queues.

Even if the host code is parallelized, there is no use for two CPU threads retaining the same kernel object. So what is the purpose of "clRetainKernel" API?

Upvotes: 1

Views: 283

Answers (1)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11926

So what is the purpose of "clRetainKernel" API?

From source https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf

Page 18:

Reference Count: The life span of an OpenCL object is determined by its reference count—an internal count of the number of references to the object. When you create an object in OpenCL, its reference countis set to one. Subsequent calls to the appropriate retainAPI (such as clRetainContext, clRetainCommandQueue) increment the reference count. Calls to the appropriate releaseAPI (such as clReleaseContext, clReleaseCommandQueue) decrement the reference count. After the reference countreaches zero, the object’s resources are deallocated by OpenCL.

It increments internal counters of its relevant opencl object and usable outside of some RAII block. I didn't use it because RAII was enough already. But if there was a "sharing" issue, this retain would help, to use it outside of scope of it. So, everyone should do own part of retaining and releasing if they are sharing anything out of its scope(especially if C api is used instead). In C++ bindings, https://github.khronos.org/OpenCL-CLHPP/cl2_8hpp_source.html#l05668 you can see that the constructor

explicit Kernel(const cl_kernel& kernel, bool retainObject = false) :  ...

does take ownership instead of incrementing the reference counter. (retain = false). Then, after some lines of code,

(with retain)

 2447         // We must retain things we obtain from the API to avoid releasing
 2448         // API-owned objects.
 2449         if (devices) {
 2450             devices->resize(ids.size());
 2451 
 2452             // Assign to param, constructing with retain behaviour
 2453             // to correctly capture each underlying CL object
 2454             for (size_type i = 0; i < ids.size(); i++) {
 2455                 (*devices)[i] = Device(ids[i], true); // true: retain
 2456             }
 2457         }

(with no-retain)

 6457             kernels->resize(value.size());
 6458 
 6459             // Assign to param, constructing with retain behaviour
 6460             // to correctly capture each underlying CL object
 6461             for (size_type i = 0; i < value.size(); i++) {
 6462                 // We do not need to retain because this kernel is being created 
 6463                 // by the runtime
 6464                 (*kernels)[i] = Kernel(value[i], false); // false: no retain
 6465             }
 6466         }

clearly says, "if you created it, you don't need to retain it".

If it is API owned thing, it will be released inside of it, so, if you need to use it, then retain. If you create something, you just create and release.

It is not possible to use the same kernel object to parallelly execute two instances of the same kernel.

No it is possible, if you use different offset on each nd-range launch.

cl_event evt;
clEnqueueWriteBuffer(queue,buffer,CL_FALSE,0,100,myCharArray.data(),0,NULL,&evt);

size_t global_work_size = 50;
clEnqueueNDRangeKernel(queue,kernel,1,NULL,&global_work_size,NULL,0, NULL, NULL);

size_t global_work_size_2 = 50;
size_t global_offset_2 = 50;
cl_event evt2;  clEnqueueNDRangeKernel(queue2,kernel,1,&global_offset_2,&global_work_size_2,NULL,1, &evt, &evt2);
clEnqueueReadBuffer(queue,buffer,CL_FALSE,0,100,myCharArray.data(),1,&evt2,NULL);

clFlush(queue);
clFlush(queue2);
clFinish(queue2);
clFinish(queue);

Make sure there is event sycnhronization between queues, to be able to see "latest bits" of data in kernel but with different offset when executing.

Second queue is synced with first one's data copy command(evt parameter). After data is copied, its event signals the other queue (queue2) so it can compute now. But on first queue, synchronization is implicit so enqueueing a compute right after the data copy enqueueing without event is ok because the used queue here is in-order queue. After queue2 completes compute, it signals readBuffer (by evt2);

This is from a single GPU sample, for multi GPU, you need also copying the data.

Even if the host code is parallelized, there is no use for two CPU threads

If synchronization is done with event polling spin-wait loop, it fully occupies its thread. If you have multiple command queues with same spin wait loop, then those two threads are needed; you can poll two events one after another within same loop too, but this needs you to take care of event handling in case of dynamically changing number of command queues. With per-thread polling, it is easier to manage the scalability of code lines.

To parallelly execute multiple instances of the same kernel, multiple kernel objects need to be created

If kernel is to be used on multiple GPUs concurrently, or on same GPU but with different buffers, then there has to be different kernel objects. Because setting kernel arguments is not an enqueue operation. It returns when it is done and it shouldn't be done while kernel is running and you can't know exact time of kernel run without getting event after it is completed. But you can add a latch before kernel execution and have a call-back there to set the arguments just in time. This must be slow so having multiple objects is both faster and simpler.

Upvotes: 3

Related Questions