Virux
Virux

Reputation: 158

Where is a ordinary variable defined inside a __device__ function placed?

In CUDA, I understand that the variable would be placed in shared memory if it was defined as __ shared __ and one would be placed in constant memory if it was defined as __ constant __.Also, those being allocated memory using cudamalloc() are put in GPU global memory. But where are those variable without prefixs like __ shared __ , __ constant __ and register placed? For example, the variable i as follow:

__device__ void func(){
   int i=0;
   return;
}

Upvotes: 1

Views: 261

Answers (3)

einpoklum
einpoklum

Reputation: 131808

Automatic variables, i.e. variables without memory space specification within the scope of functions, are placed in one of the following locations:

  1. When optimized away:

    1.1 Nowhere - if the variable isn't actually necessary. This actually happens a lot, since CUDA functions are often inlined, with some variables becoming copies of a variable in the calling function. Example (note the x from foo() in the compilation of bar() - completely gone).

    1.2 Immediate values in the program's compiled code - if the variable's value is constant, and doesn't get updated, its value may simply be "plugged" into the code. Here's an example with two variables taking constants, which are replaced with the constant which is their sum.

  2. When not optimized away:

    2.1 Registers - If your variable can't be optimized-away, the better alternative is to keep it in a hardware register on (one of the symmetric multiprocessor core on) the GPU. Example (the variables x and y are placed in registers %r1 and %r2). the best and most performant option, which the compiler

    2.2 'Local' memory - The 'local memory' of a CUDA thread is an area in global device memory which is (in principle) accessible only by that thread.

Now, obviously, local memory is much slower to use. When will the compiler choose it, then?

The CUDA Programming Guide gives us the answer:

  1. When the automatic variable is too large to fit in the register file for the current thread (each thread typically gets between 63 and 255 4-byte registers).
  2. When the automatic variable is an array, or has an array member, which is indexed with a non-constant offset. Unfortunately, NVIDIA GPU multiprocessors don't support register indexing.
  3. When the kernel is overusing its available quota of registers is already full with other variables or uses by the compiled code - even if the variable itself is very small. This is referred to as register spilling.

Upvotes: 2

Jérôme Richard
Jérôme Richard

Reputation: 50623

GPUs have a bunch of space dedicated for many registers stored directly in GPU computing units (ie. Streaming Multiprocessors). Registers are not stored in memory unless there is some register spilling happening (typically when you use too many registers in a given kernel). Register have no address unlike all memory bytes. The same thing happen for CPU except that the number of CPU registers is usually much smaller than on GPU. For example, an Intel Skylake core has 180 integer register and 168 vector registers while the instruction set architecture is limited to something like 16 integer/vector registers. Note that in case of register spilling, the value of registers is temporary stored in local memory (typically in the L1 cache if possible). Here is the overall memory hierarchy of a basic (Nvidia Fermi) GPU:

enter image description here

For more information, consider reading: Local Memory and Register Spilling.

Upvotes: 1

Abator Abetor
Abator Abetor

Reputation: 2598

Local variables are either placed in hardware registers or local memory (which is effectively global memory).

In your example, however, variable i will be removed by the compiler because it is unused.

Upvotes: 2

Related Questions