Reputation: 815
From this answer, it seems that GPU manufacturers just provide a driver for particular GPU APIs, and that there's no such thing as GPU assembly or at the very least, there will never be a GPU assembly programming manual published like the AMD64 programmer's manual
However, as I understand, all processes run go through the CPU, and can be disassembled.
My question is: What would the assembly of a program using the GPU look like? My hypothesis is that it would use system calls to manipulate a device file representing the GPU. Is this hypothesis correct?
Upvotes: 1
Views: 5156
Reputation: 5746
If you are using a Nvidia GPU, then you can you can view the PTX assembly code. PTX is only pseudo-assembly, sort of in between OpenCL and the binary code that actually runs on the GPU. This is how you get to it from OpenCL:
Context context(device);
queue = CommandQueue(context, device); // queue to push commands for the device
Program::Sources source;
string kernel_code = opencl_code_settings(N,M)+opencl_code();
source.push_back({ kernel_code.c_str(), kernel_code.length() });
Program program(context, source);
if(program.build("-cl-fast-relaxed-math")) return false; // compile OpenCL code, return false if there is an error
const string ptx_code = program.getInfo<CL_PROGRAM_BINARIES>()[0]; // generate assembly (ptx) for OpenCL code
The string ptx_code
is what you are looking for. Here is a small example kernel:
kernel void benchmark_1(global float* data) {
const uint n = get_global_id(0);
#pragma unroll
for(uint i=0; i<def_M; i++) data[i*def_N+n] = 0.0f;
}
This is what the PTX code for this kernel looks like:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: UNKNOWN
// Driver
// Based on LLVM 3.4svn
//
.version 6.2
.target sm_61, texmode_independent
.address_size 64
// .globl benchmark_1
.entry benchmark_1(
.param .u64 .ptr .global .align 4 benchmark_1_param_0
)
{
.reg .b32 %r<23>;
.reg .b64 %rd<34>;
ld.param.u64 %rd1, [benchmark_1_param_0];
mov.b32 %r1, %envreg3;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %ctaid.x;
mad.lo.s32 %r4, %r3, %r2, %r1;
mov.u32 %r5, %tid.x;
add.s32 %r6, %r4, %r5;
mul.wide.u32 %rd2, %r6, 4;
add.s64 %rd3, %rd1, %rd2;
mov.u32 %r7, 0;
st.global.u32 [%rd3], %r7;
add.s32 %r8, %r6, 15728640;
mul.wide.u32 %rd4, %r8, 4;
add.s64 %rd5, %rd1, %rd4;
st.global.u32 [%rd5], %r7;
add.s32 %r9, %r6, 31457280;
mul.wide.u32 %rd6, %r9, 4;
add.s64 %rd7, %rd1, %rd6;
st.global.u32 [%rd7], %r7;
add.s32 %r10, %r6, 47185920;
mul.wide.u32 %rd8, %r10, 4;
add.s64 %rd9, %rd1, %rd8;
st.global.u32 [%rd9], %r7;
add.s32 %r11, %r6, 62914560;
mul.wide.u32 %rd10, %r11, 4;
add.s64 %rd11, %rd1, %rd10;
st.global.u32 [%rd11], %r7;
add.s32 %r12, %r6, 78643200;
mul.wide.u32 %rd12, %r12, 4;
add.s64 %rd13, %rd1, %rd12;
st.global.u32 [%rd13], %r7;
add.s32 %r13, %r6, 94371840;
mul.wide.u32 %rd14, %r13, 4;
add.s64 %rd15, %rd1, %rd14;
st.global.u32 [%rd15], %r7;
add.s32 %r14, %r6, 110100480;
mul.wide.u32 %rd16, %r14, 4;
add.s64 %rd17, %rd1, %rd16;
st.global.u32 [%rd17], %r7;
add.s32 %r15, %r6, 125829120;
mul.wide.u32 %rd18, %r15, 4;
add.s64 %rd19, %rd1, %rd18;
st.global.u32 [%rd19], %r7;
add.s32 %r16, %r6, 141557760;
mul.wide.u32 %rd20, %r16, 4;
add.s64 %rd21, %rd1, %rd20;
st.global.u32 [%rd21], %r7;
add.s32 %r17, %r6, 157286400;
mul.wide.u32 %rd22, %r17, 4;
add.s64 %rd23, %rd1, %rd22;
st.global.u32 [%rd23], %r7;
add.s32 %r18, %r6, 173015040;
mul.wide.u32 %rd24, %r18, 4;
add.s64 %rd25, %rd1, %rd24;
st.global.u32 [%rd25], %r7;
add.s32 %r19, %r6, 188743680;
mul.wide.u32 %rd26, %r19, 4;
add.s64 %rd27, %rd1, %rd26;
st.global.u32 [%rd27], %r7;
add.s32 %r20, %r6, 204472320;
mul.wide.u32 %rd28, %r20, 4;
add.s64 %rd29, %rd1, %rd28;
st.global.u32 [%rd29], %r7;
add.s32 %r21, %r6, 220200960;
mul.wide.u32 %rd30, %r21, 4;
add.s64 %rd31, %rd1, %rd30;
st.global.u32 [%rd31], %r7;
add.s32 %r22, %r6, 235929600;
mul.wide.u32 %rd32, %r22, 4;
add.s64 %rd33, %rd1, %rd32;
st.global.u32 [%rd33], %r7;
ret;
}
From the PTX code you can for example count the FLOPs and memory transfers in order to examine how efficient the code runs via the roofline model.
Upvotes: 5
Reputation: 201
To put it simplest:
To add onto the "microarchitecture" information:
With NVCC you can use the --ptx flag to generate ptx from your CUDA source files.
With NVCC you can also specify an architecture using -arch=typehere
More information from Nvidia's site.
Upvotes: 0
Reputation: 471
I was wondering this too. I found this answer with good details
Not sure, but if you were interested in lower level details like I was:
From what I read, it seems like protocols for communicating to GPUs have probably changed a lot, but at a high level I think it's something like below (I'm sorry it's hand wave-y and not that precise)
ioctl
system calls in user-space).Ben Eater's videos on youtube were awesome, super helpful for understanding what happens on a hardware level when you, say, wanna display a pixel on your screen. He has a series where he interfaces a microcontroller directly to an lcd screen via VGA and writes assembly code to display an image on the screen.
Upvotes: 3
Reputation: 1
What does code that uses a GPU even look like?
Read much more about OpenCL (or, for Nvidia hardware only, about CUDA). Be also aware of OpenACC ! See also OpenCL related resources, and read some OpenCL book. Read some OpenCL tutorial.
In practice, you'll never see the "assembler code" of your GPGPU. But you'll code using OpenCL (it is very low-level, and tuning your code to your particular hardware is difficult and error-prone).
AFAIK, AMD tend to publish the "machine code specification" (e.g. the ISA) of most of its GPUs. Nvidia is much more secretive. Notice that SPIR is "assembly-like" (actually LLVM bytecode based), but still not exactly an assembler.
My question is: What would the assembly of a program using the GPU look like? My hypothesis is that it would use system calls to manipulate a device file representing the GPU. Is this hypothesis correct?
The system calls (very hardware specific) are transmitting the SPIR or equivalent bytecode (and often GPGPU-specific machine code) from CPU (and virtual memory) to GPU, and also the data from GPGPU to CPU (& memory) and back. Details are boringly complex, and generally proprietary to hardware manufacturers. You prefer to use the OpenCL (or the CUDA) API and dialect. Your hypothesis is wrong, or at least oversimplified to the point of being meaningless.
Look also into osdev.org wiki.
Actually, several open-source numerical libraries (such as TensorFlow, OpenCV, BLAS, ...) have OpenCL backends. So take several months to study their source code.
Understanding all the details will give you a PhD. Albert Cohen (and many other experts) might be your advisor.
Read also more about AMDGPU and their GCN. For example, look into AMD Vega specification.
However, as I understand, all processes run go through the CPU, and can be disassembled.
This is a very naive claim, and I believe it is wrong (at least for the programs I like writing, they all somehow generate code at runtime). And in practice, you won't understand the disassembled code (that is why decompilation is so difficult). For example of programs generating machine code, look (on Linux) into SBCL (its REPL is emitting machine code at every user interaction), or into any meta-program, or most programs using JIT-compilation techniques (in practice, most Java JVMs are doing JIT translation). My manydl.c
Linux program is generating C code at runtime, compiling that into a shared library i.e. a plugin which can be dynamically linked, then dlopen(3)-ing that plugin (and can repeat all that many hundred thousands of time). For an example of library helpful to generate machine code, see libgccjit.
You should also read more about OSes in general. I strongly recommend Operating Systems: Three Easy Pieces (freely downloadable).
Upvotes: 6