C++11: thread_local or array of OpenCL 1.2 cl_kernel objects?

Question

I need to run several C++11 threads (GCC 4.7.1) parallely in host. Each of them needs to use a device, say a GPU. As per OpenCL 1.2 spec (p. 357):

All OpenCL API calls are thread-safe75 except clSetKernelArg. 
clSetKernelArg is safe to call from any host thread, and is safe
to call re-entrantly so long as concurrent calls operate on different
cl_kernel objects. However, the behavior of the cl_kernel object is
undefined if clSetKernelArg is called from multiple host threads on
the same cl_kernel object at the same time.

An elegant way would be to use thread_local cl_kernel objects and the other way I can think of is to use an array of these objects such that i'th thread uses i'th object. As I have not implemented these earlier I was wondering if any of the two are good or are there better ways of getting things done.

A third way perhaps would be to use a mutex for a single cl_object and associate it with an event handler. Then the thread can wait till the event is finished. Not sure if this works though in multi-threaded situation...

Stacker · Accepted Answer

The main question is if all these threads need to use the same kernel or if each one gets an own distinct kernel. Your idea to use either thread_local cl_kernel objects or an array of n kernel objects both result in n kernel objects being created and are equally well from OpenCL's perspective. If they all contain the same code, though, then you unnecessarily waste space/cause context switches/mess up caching/... and would be comparable to loading an application binary into memory multiple times without sharing the constant binary code segments.

If you actually want to use the same kernel from within multiple threads then I'd suggest to perform manual synchronization on a single cl_kernel object. If you don't want your threads to block waiting until other threads completed their work you can use asynchronous command queuing and events to get notified once the work of a particular thread is done (to prevent a thread to queue work faster than the GPU can process it or to read back results of course).

If your threads shall execute different kernel programs then I suggest to create a separate command queue per thread to simplify execution. Is it totally up to you then if you chose to store these object handles in thread local storage, in a global array or elsewhere.

C++11: thread_local or array of OpenCL 1.2 cl_kernel objects?

Answers (1)

Related Questions