extremeaxe5
extremeaxe5

Reputation: 815

What does a program (the assembly) that uses a GPU even look like?

From this answer, it seems that GPU manufacturers just provide a driver for particular GPU APIs, and that there's no such thing as GPU assembly or at the very least, there will never be a GPU assembly programming manual published like the AMD64 programmer's manual

However, as I understand, all processes run go through the CPU, and can be disassembled.

My question is: What would the assembly of a program using the GPU look like? My hypothesis is that it would use system calls to manipulate a device file representing the GPU. Is this hypothesis correct?

Upvotes: 1

Views: 5156

Answers (4)

ProjectPhysX
ProjectPhysX

Reputation: 5746

If you are using a Nvidia GPU, then you can you can view the PTX assembly code. PTX is only pseudo-assembly, sort of in between OpenCL and the binary code that actually runs on the GPU. This is how you get to it from OpenCL:

Context context(device);
queue = CommandQueue(context, device); // queue to push commands for the device
Program::Sources source;
string kernel_code = opencl_code_settings(N,M)+opencl_code();
source.push_back({ kernel_code.c_str(), kernel_code.length() });
Program program(context, source);
if(program.build("-cl-fast-relaxed-math")) return false; // compile OpenCL code, return false if there is an error
const string ptx_code = program.getInfo<CL_PROGRAM_BINARIES>()[0]; // generate assembly (ptx) for OpenCL code

The string ptx_code is what you are looking for. Here is a small example kernel:

kernel void benchmark_1(global float* data) {
    const uint n = get_global_id(0);
    #pragma unroll
    for(uint i=0; i<def_M; i++) data[i*def_N+n] = 0.0f;
}

This is what the PTX code for this kernel looks like:

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: UNKNOWN
// Driver 
// Based on LLVM 3.4svn
//

.version 6.2
.target sm_61, texmode_independent
.address_size 64

    // .globl   benchmark_1

.entry benchmark_1(
    .param .u64 .ptr .global .align 4 benchmark_1_param_0
)
{
    .reg .b32   %r<23>;
    .reg .b64   %rd<34>;


    ld.param.u64    %rd1, [benchmark_1_param_0];
    mov.b32 %r1, %envreg3;
    mov.u32     %r2, %ntid.x;
    mov.u32     %r3, %ctaid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    mov.u32     %r5, %tid.x;
    add.s32     %r6, %r4, %r5;
    mul.wide.u32    %rd2, %r6, 4;
    add.s64     %rd3, %rd1, %rd2;
    mov.u32     %r7, 0;
    st.global.u32   [%rd3], %r7;
    add.s32     %r8, %r6, 15728640;
    mul.wide.u32    %rd4, %r8, 4;
    add.s64     %rd5, %rd1, %rd4;
    st.global.u32   [%rd5], %r7;
    add.s32     %r9, %r6, 31457280;
    mul.wide.u32    %rd6, %r9, 4;
    add.s64     %rd7, %rd1, %rd6;
    st.global.u32   [%rd7], %r7;
    add.s32     %r10, %r6, 47185920;
    mul.wide.u32    %rd8, %r10, 4;
    add.s64     %rd9, %rd1, %rd8;
    st.global.u32   [%rd9], %r7;
    add.s32     %r11, %r6, 62914560;
    mul.wide.u32    %rd10, %r11, 4;
    add.s64     %rd11, %rd1, %rd10;
    st.global.u32   [%rd11], %r7;
    add.s32     %r12, %r6, 78643200;
    mul.wide.u32    %rd12, %r12, 4;
    add.s64     %rd13, %rd1, %rd12;
    st.global.u32   [%rd13], %r7;
    add.s32     %r13, %r6, 94371840;
    mul.wide.u32    %rd14, %r13, 4;
    add.s64     %rd15, %rd1, %rd14;
    st.global.u32   [%rd15], %r7;
    add.s32     %r14, %r6, 110100480;
    mul.wide.u32    %rd16, %r14, 4;
    add.s64     %rd17, %rd1, %rd16;
    st.global.u32   [%rd17], %r7;
    add.s32     %r15, %r6, 125829120;
    mul.wide.u32    %rd18, %r15, 4;
    add.s64     %rd19, %rd1, %rd18;
    st.global.u32   [%rd19], %r7;
    add.s32     %r16, %r6, 141557760;
    mul.wide.u32    %rd20, %r16, 4;
    add.s64     %rd21, %rd1, %rd20;
    st.global.u32   [%rd21], %r7;
    add.s32     %r17, %r6, 157286400;
    mul.wide.u32    %rd22, %r17, 4;
    add.s64     %rd23, %rd1, %rd22;
    st.global.u32   [%rd23], %r7;
    add.s32     %r18, %r6, 173015040;
    mul.wide.u32    %rd24, %r18, 4;
    add.s64     %rd25, %rd1, %rd24;
    st.global.u32   [%rd25], %r7;
    add.s32     %r19, %r6, 188743680;
    mul.wide.u32    %rd26, %r19, 4;
    add.s64     %rd27, %rd1, %rd26;
    st.global.u32   [%rd27], %r7;
    add.s32     %r20, %r6, 204472320;
    mul.wide.u32    %rd28, %r20, 4;
    add.s64     %rd29, %rd1, %rd28;
    st.global.u32   [%rd29], %r7;
    add.s32     %r21, %r6, 220200960;
    mul.wide.u32    %rd30, %r21, 4;
    add.s64     %rd31, %rd1, %rd30;
    st.global.u32   [%rd31], %r7;
    add.s32     %r22, %r6, 235929600;
    mul.wide.u32    %rd32, %r22, 4;
    add.s64     %rd33, %rd1, %rd32;
    st.global.u32   [%rd33], %r7;
    ret;
}

From the PTX code you can for example count the FLOPs and memory transfers in order to examine how efficient the code runs via the roofline model.

Upvotes: 5

GmanGavin
GmanGavin

Reputation: 201

To put it simplest:

  • Nvidia GPU -> PTX (Parallel Thread Execution)
  • AMD GPU -> GCN (Graphics Core Next) / ROCm (Radeon Open Compute)

To add onto the "microarchitecture" information:

  • CUDA uses NVCC as its compiler

With NVCC you can use the --ptx flag to generate ptx from your CUDA source files.

With NVCC you can also specify an architecture using -arch=typehere enter image description here


More information from Nvidia's site.

Upvotes: 0

matt0089
matt0089

Reputation: 471

I was wondering this too. I found this answer with good details

...interaction with kernel for rendering as well as making use of GPU for preparing a scene before rendering a frame happen via ioctls to a DRM device


Not sure, but if you were interested in lower level details like I was:

From what I read, it seems like protocols for communicating to GPUs have probably changed a lot, but at a high level I think it's something like below (I'm sorry it's hand wave-y and not that precise)

  1. the cpu takes shader code, compiles it to instructions that a GPU core can understand
  2. the CPU loads these instructions into its memory mapped address space that is connected to the gpu (I'm guessing this can be executed via ioctl system calls in user-space).
  3. The CPU maybe has a specific memory mapped address it can write to to tell the graphics card it's done
  4. The graphics card then loads this shader program into its memory for the GPUs to execute
  5. and so on...

Ben Eater's videos on youtube were awesome, super helpful for understanding what happens on a hardware level when you, say, wanna display a pixel on your screen. He has a series where he interfaces a microcontroller directly to an lcd screen via VGA and writes assembly code to display an image on the screen.

Upvotes: 3

What does code that uses a GPU even look like?

Read much more about OpenCL (or, for Nvidia hardware only, about CUDA). Be also aware of OpenACC ! See also OpenCL related resources, and read some OpenCL book. Read some OpenCL tutorial.

In practice, you'll never see the "assembler code" of your GPGPU. But you'll code using OpenCL (it is very low-level, and tuning your code to your particular hardware is difficult and error-prone).

AFAIK, AMD tend to publish the "machine code specification" (e.g. the ISA) of most of its GPUs. Nvidia is much more secretive. Notice that SPIR is "assembly-like" (actually LLVM bytecode based), but still not exactly an assembler.

My question is: What would the assembly of a program using the GPU look like? My hypothesis is that it would use system calls to manipulate a device file representing the GPU. Is this hypothesis correct?

The system calls (very hardware specific) are transmitting the SPIR or equivalent bytecode (and often GPGPU-specific machine code) from CPU (and virtual memory) to GPU, and also the data from GPGPU to CPU (& memory) and back. Details are boringly complex, and generally proprietary to hardware manufacturers. You prefer to use the OpenCL (or the CUDA) API and dialect. Your hypothesis is wrong, or at least oversimplified to the point of being meaningless.

Look also into osdev.org wiki.

Actually, several open-source numerical libraries (such as TensorFlow, OpenCV, BLAS, ...) have OpenCL backends. So take several months to study their source code.

Understanding all the details will give you a PhD. Albert Cohen (and many other experts) might be your advisor.

Read also more about AMDGPU and their GCN. For example, look into AMD Vega specification.

However, as I understand, all processes run go through the CPU, and can be disassembled.

This is a very naive claim, and I believe it is wrong (at least for the programs I like writing, they all somehow generate code at runtime). And in practice, you won't understand the disassembled code (that is why decompilation is so difficult). For example of programs generating machine code, look (on Linux) into SBCL (its REPL is emitting machine code at every user interaction), or into any meta-program, or most programs using JIT-compilation techniques (in practice, most Java JVMs are doing JIT translation). My manydl.c Linux program is generating C code at runtime, compiling that into a shared library i.e. a plugin which can be dynamically linked, then dlopen(3)-ing that plugin (and can repeat all that many hundred thousands of time). For an example of library helpful to generate machine code, see libgccjit.

You should also read more about OSes in general. I strongly recommend Operating Systems: Three Easy Pieces (freely downloadable).

Upvotes: 6

Related Questions