Reputation: 104
It is not possible to use the same kernel object to parallelly execute two instances of the same kernel.
To parallelly execute multiple instances of the same kernel, multiple kernel objects need to be created from the same program object and enqueued to different command-queues.
Even if the host code is parallelized, there is no use for two CPU threads retaining the same kernel object. So what is the purpose of "clRetainKernel" API?
Upvotes: 1
Views: 283
Reputation: 11926
So what is the purpose of "clRetainKernel" API?
From source https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf
Page 18:
Reference Count: The life span of an OpenCL object is determined by its reference count—an internal count of the number of references to the object. When you create an object in OpenCL, its reference countis set to one. Subsequent calls to the appropriate retainAPI (such as clRetainContext, clRetainCommandQueue) increment the reference count. Calls to the appropriate releaseAPI (such as clReleaseContext, clReleaseCommandQueue) decrement the reference count. After the reference countreaches zero, the object’s resources are deallocated by OpenCL.
It increments internal counters of its relevant opencl object and usable outside of some RAII block. I didn't use it because RAII was enough already. But if there was a "sharing" issue, this retain would help, to use it outside of scope of it. So, everyone should do own part of retaining and releasing if they are sharing anything out of its scope(especially if C api is used instead). In C++ bindings, https://github.khronos.org/OpenCL-CLHPP/cl2_8hpp_source.html#l05668 you can see that the constructor
explicit Kernel(const cl_kernel& kernel, bool retainObject = false) : ...
does take ownership instead of incrementing the reference counter. (retain = false). Then, after some lines of code,
(with retain)
2447 // We must retain things we obtain from the API to avoid releasing
2448 // API-owned objects.
2449 if (devices) {
2450 devices->resize(ids.size());
2451
2452 // Assign to param, constructing with retain behaviour
2453 // to correctly capture each underlying CL object
2454 for (size_type i = 0; i < ids.size(); i++) {
2455 (*devices)[i] = Device(ids[i], true); // true: retain
2456 }
2457 }
(with no-retain)
6457 kernels->resize(value.size());
6458
6459 // Assign to param, constructing with retain behaviour
6460 // to correctly capture each underlying CL object
6461 for (size_type i = 0; i < value.size(); i++) {
6462 // We do not need to retain because this kernel is being created
6463 // by the runtime
6464 (*kernels)[i] = Kernel(value[i], false); // false: no retain
6465 }
6466 }
clearly says, "if you created it, you don't need to retain it".
If it is API owned thing, it will be released inside of it, so, if you need to use it, then retain. If you create something, you just create and release.
It is not possible to use the same kernel object to parallelly execute two instances of the same kernel.
No it is possible, if you use different offset on each nd-range launch.
cl_event evt;
clEnqueueWriteBuffer(queue,buffer,CL_FALSE,0,100,myCharArray.data(),0,NULL,&evt);
size_t global_work_size = 50;
clEnqueueNDRangeKernel(queue,kernel,1,NULL,&global_work_size,NULL,0, NULL, NULL);
size_t global_work_size_2 = 50;
size_t global_offset_2 = 50;
cl_event evt2; clEnqueueNDRangeKernel(queue2,kernel,1,&global_offset_2,&global_work_size_2,NULL,1, &evt, &evt2);
clEnqueueReadBuffer(queue,buffer,CL_FALSE,0,100,myCharArray.data(),1,&evt2,NULL);
clFlush(queue);
clFlush(queue2);
clFinish(queue2);
clFinish(queue);
Make sure there is event sycnhronization between queues, to be able to see "latest bits" of data in kernel but with different offset when executing.
Second queue is synced with first one's data copy command(evt parameter). After data is copied, its event signals the other queue (queue2) so it can compute now. But on first queue, synchronization is implicit so enqueueing a compute right after the data copy enqueueing without event is ok because the used queue here is in-order queue. After queue2 completes compute, it signals readBuffer (by evt2);
This is from a single GPU sample, for multi GPU, you need also copying the data.
Even if the host code is parallelized, there is no use for two CPU threads
If synchronization is done with event polling spin-wait loop, it fully occupies its thread. If you have multiple command queues with same spin wait loop, then those two threads are needed; you can poll two events one after another within same loop too, but this needs you to take care of event handling in case of dynamically changing number of command queues. With per-thread polling, it is easier to manage the scalability of code lines.
To parallelly execute multiple instances of the same kernel, multiple kernel objects need to be created
If kernel is to be used on multiple GPUs concurrently, or on same GPU but with different buffers, then there has to be different kernel objects. Because setting kernel arguments is not an enqueue operation. It returns when it is done and it shouldn't be done while kernel is running and you can't know exact time of kernel run without getting event after it is completed. But you can add a latch before kernel execution and have a call-back there to set the arguments just in time. This must be slow so having multiple objects is both faster and simpler.
Upvotes: 3