amnl
amnl

Reputation: 71

Instruction transfer between CPU and GPU

I'm looking for information related to how CPU moves program code to the GPU when working with GPGPU computation. Internet is plenty of manuals about data transfer, but not about instruction/program loading.

The question is: program is handled by the CPU, which "configures" the GPU with the adequate flags on each computing unit to perform a given operation. After that, data is transfered and processed. How the firs operation is done? How instructions are issued to the GPU? Are the instructions somehow packet to take advantage of the bus bandwidth? I may have ignore something fundamental, so any additional information is welcome.

Upvotes: 4

Views: 1367

Answers (1)

aland
aland

Reputation: 5154

There is indeed not much information about it, but you overestimate the effect.

The whole kernel code is loaded onto GPU only once (at worst once-per-kernel-invocation, but it looks like it is actually once-per-application-run, see below), and then is executed completely on the GPU without any intervention from CPU. So, whole kernel code is copied in one chunk somewhere before kernel invocation. To estimate code size, the .cubin size of all GPU code of our home-made MD package (52 kernels, some of which are > 150 lines of code) is only 91 KiB, so it's safe to assume that in pretty much all the cases the code transfer time is negligible.

Here's is what information I've found in official docs:

In CUDA Driver API, the code is loaded on device the time you call cuModuleLoad function

The CUDA driver API does not attempt to lazily allocate the resources needed by a module; if the memory for functions and data (constant and global) needed by the module cannot be allocated, cuModuleLoad() fails

Theoretically, you might have to unload the module and then load it again, if you have several modules which use too much constant (or statically allocated global) memory to be loaded simultaneously, but it's quite uncommon, and you usually call cuModuleLoad only once per application launch, right after context creation.

CUDA Runtime API does not provide any measures of controlling module loading/unloading, but it looks like all the necessary code is loaded onto device during it's initialization.

OpenCL Specs are not as specific as CUDA Driver API, but the code is likely (wild guessing involved) copied to device on clBuildProgram stage.

Upvotes: 3

Related Questions