Reputation: 9092
The same OpenCL program is compiled on different OpenCL devices, possibly on different platforms. For each device a command queue is created. So for example there could be two queues, one for CPU and one for GPU.
Is it possible to call clEnqueueNDRangeKernel
and then clEnqueueReadBuffer
(blocking) on the two command queues, from different host threads (one for each command queue)?
For example using OpenMP, with a loop like
// queues_ contains command queues for different contexts,
// each with one device on one platform (e.g. CPU and GPU)
#pragma omp parallel for num_threads(2) schedule(dynamic)
for(int i = 0; i < job_count; ++i) {
cl::CommandQueue& queue = queues_[omp_get_thread_num()];
// queue is for one device on one platform
// euqueue kernel, and read buffer on queue
}
This would divide the job list into two chunks for CPU and GPU. schedule(dynamic)
would make it so that the scheduling dynamically adapts to the execution times of the kernels.
The host code would spend most time waiting for the kernel (in the blocking clEnqueueReadBuffer
call.) But thanks to the CPU device, the CPU would actually be busy executing the kernel (in OpenCL), and at the same time waiting for the GPU to finish (in the host code).
Upvotes: 1
Views: 1170
Reputation: 11910
If contexts are different too, then they work independently, even with 3D applications. Depending on implementation, two contexts could be preempted or ultra threaded by drivers but you can further add event based synchronization between contexts such that one item in queue-a waits for an item completion in queue-b.
If they live in same context, you can do implicit synchronization between two queues with drivers or apis performant manipulations.
Using all cores of cpu for a memory bound kernel doesnt let it do array copying to and from gpu fast enough unless you use direct memory accessing When copying which sets cpu free of copying instruction. If cache is big and fast enough maybe it doesnt need such thing.
Upvotes: 1