benshope
benshope

Reputation: 3024

OpenCL clEnqueueReadBuffer During Kernel Execution?

Can queued kernels continue to execute while an OpenCL clEnqueueReadBuffer operation is occurring?

In other words, is clEnqueueReadBuffer a blocking operation on the device?

Upvotes: 1

Views: 2451

Answers (1)

Dithermaster
Dithermaster

Reputation: 6333

From a host API point of view, clEnqueueReadBuffer can be blocking or not, depending on if you set the blocking_read parameter to CL_TRUE or CL_FALSE.

If you set it to not block, then the read just gets queued and you should use an event (or subsequent blocking call) to determine when it has finished (i.e., before you access the memory that you are reading to).

If you set it to block, the call won't return until the read is done. The memory being read to will be correct. Also (and answering your actual question) any operations you queued prior to the clEnqueueReadBuffer will all have to finish first before the read starts (see exception note below).

All clEnqueue* API calls are asynchronous, but some have "blocking" parameters you can set. Using it is the equivalent to using a non-blocking version and then calling clFinish instead. The command queue will be flushed to the device and your host thread won't continue until the work has finished. Of course, it is hard to keep the GPU always busy doing it this way, since now it doesn't have any work, but if you queue up new work fast enough you can still keep it reasonably busy.

This all assumes a single, in-order command queue. If your command queue is out-of-order and your device supports out-of-order queues then enqueued items can execute in any order that doesn't violate the event_wait_list parameters you provided. Likewise, you can have multiple command queues, which can again be executed in any order that doesn't violate the event_wait_list parameters you provided. Typically, they are used to overlap memory transfers and compute, and to keep multiple compute units busy. Out-of-order command queues and multiple command queues are both advanced OpenCL concepts and shouldn't be attempted until you fully understand and have experience with in-order command queues.

Clarification added later after DarkZeros pointed out the "on the device" part of the OP's question: My answer was from the host thread API point of view. On the device, with an in-order command queue all downstream commands are blocked by the current command. With an out-of-order queue they are only blocked by the event_wait_list. However, out-of-order command queues are not well supported in today's drivers. With multiple command queues, in theory commands are only blocked by prior commands (if in-order) and the event_wait_list. In reality, there are sometimes special vendor rules that prevent the free flowing of potentially non-blocked commands that you might like. This is often because the multiple OpenCL command queues get transferred to device-side memory and compute queues, and get executed in-order there. So depending on the order that you add commands to your multiple command queues, they might get interleaved in such a way that they block in sub-optimal ways. The best solution I'm aware of is to either be careful about the order you enqueue (based on knowledge of this implementation detail), or use one queue for memory and one for compute, which matches the device-side queueing.

If overlap of memory and compute is your goal, both AMD and NVIDIA both provide examples of how to overlap memory and compute operations, and for GPUs that support multiple compute operations, how to do that too. NVIDIA examples are hard to get ahold of but they are out there (from CUDA 4 days).

Upvotes: 6

Related Questions