lanza
lanza

Reputation: 1642

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.

You then enqueue the kernel to run and read back the buffer.

Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?

In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.

Upvotes: 1

Views: 641

Answers (2)

N0dGrand87
N0dGrand87

Reputation: 817

The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.

From the spec:

The command-queue can be used to queue a set of operations (referred to as commands) in order. https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues

Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.

To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see: https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

Upvotes: 0

pmdj
pmdj

Reputation: 23438

It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.

A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.

With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:

  1. A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
  2. The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
  3. Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.

On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.

If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.

Upvotes: 1

Related Questions