einpoklum
einpoklum

Reputation: 131547

Why do I see MOV Rn, Rn instructions in debugging-mode nVIDIA SASS code?

Here's a snippet of some SASS code for a kernel I'm working on (for an sm52 target, compiled in debugging mode):

/*0028*/                   ISETP.GE.U32.AND P0, PT, R1, R0, PT;    /* 0x5b6c038000070107 */
/*0030*/               @P0 BRA 0x40;                               /* 0xe24000000080000f */
/*0038*/                   BPT.TRAP 0x1;                           /* 0xe3a00000001000c0 */
                                                                   /* 0x007fbc0321e01fef */
/*0048*/                   IADD R2, R1, RZ;                        /* 0x5c1000000ff70102 */
/*0050*/                   I2I.U32.U32 R2, R2;                     /* 0x5ce0000000270a02 */
/*0058*/                   MOV R2, R2;                             /* 0x5c98078000270002 */
                                                                   /* 0x007fbc03fde01fef */
/*0068*/                   MOV R3, RZ;                             /* 0x5c9807800ff70003 */
/*0070*/                   MOV R2, R2;                             /* 0x5c98078000270002 */
/*0078*/                   MOV R3, R3;                             /* 0x5c98078000370003 */
                                                                   /* 0x007fbc03fde01fef */
/*0088*/                   MOV R4, R2;                             /* 0x5c98078000270004 */
/*0090*/                   MOV R5, R3;                             /* 0x5c98078000370005 */
/*0098*/                   MOV R2, c[0x0][0x4];                    /* 0x4c98078000170002 */
                                                                   /* 0x007fbc03fde01fef */
/*00a8*/                   MOV R3, RZ;                             /* 0x5c9807800ff70003 */
/*00b0*/                   LOP.OR R2, R4, R2;                      /* 0x5c47020000270402 */
/*00b8*/                   LOP.OR R3, R5, R3;                      /* 0x5c47020000370503 */

I'm noticing more than a couple of instructions of the form "Move the contents of register Rn to register Rn" - and that doesn't seen to make sense. I know that when compiling without debugging info enabled, and with optimizations, I don't get these instructions. But, even in debugging mode - why are they there? What's their purpose? AFAIK, when compiling CPU code for debugging you don't get these kind of instructions.

Upvotes: 1

Views: 622

Answers (2)

Ross Ridge
Ross Ridge

Reputation: 39581

The simple answer you get that get strange code because you've turned on debugging which turns off optimization. This is normal with modern optimizing compilers because of how they work. They break down operations into a primitive static single-assignment (SSA) form which makes it easier to optimize but when not optimizing generates worse code that more simpler non-optimizing compiler would.

There's also a possibility, though I don't think it's the case here, that the instructions are deliberately inserted NOPs in order delay execution. GPUs have instruction sets that are much much different than the general purpose CPUs that you may familiar with. For example most CPUs work as if instructions are executed one at a time and strictly in the order they're given. This is true despite the fact that modern CPUs will try to execute instructions in parallel and even out of order, for improved performance. GPUs typically don't work this way. If you try to use the result that a previous instruction stores in some register before that instruction is finished, you'll get the old value of the register. Unlike a CPU, a GPU won't automatically wait for the instruction to finish before executing the next instruction that depends on it.

If you look at the dissembled code you'll notice that instructions are grouped into bundles of three instructions. You might also see that there's hidden instructions between the bundles. The machine code for the instruction is shown on the right (eg. /* 0x007fbc0321e01fef */), but its not disassembled on the left and its address isn't shown despite taking up an 8-byte slot like any other instruction. This actually a scheduling block control code. It's not a real instruction, but instead it instructs the GPU how it should schedule the instructions in the bundle before it. It tells the GPU things like which instructions need to wait for previous instructions to complete and how long they should wait.

Finally there's one more possibility, though extremely unlikely, that the redundant MOVs aren't actually NOPs at all. They could be acting on yet to overwritten register values and in parallel with other instructions in some weird manner that gives them a useful effect other than a delay. However this would be a very advanced optimization technique that I would only expect in hand-tuned assembly code, not in a compiler that isn't even generating optimized code.

Upvotes: 2

Ped7g
Ped7g

Reputation: 16596

Based on general compiler knowledge, I have no knowledge about CUDA.

Most of the programming languages have mostly context/state-less commands. Each such command can be compiled separately on it's own, into the target machine code/opcode output (making this compilation step sort of simple to implement, dealing only with single actually parsed command). Some exceptions are various prefix/suffix/with modifiers, or things like continue/break to control loops.

For example variable = variable + 2; can be compiled into "add two to variable" independently from previous and next command in the source (simple and fast), which turns into: "load variable from memory into register, add two to register, store value from register back to variable memory".

Which register will be used is difficult to decide. If you would think about it for a while, a random register allocation is just as good as any other naive allocation rule. That is often the way how registers are allocated at the early stage of compilation (using any register with smallest penalty for being clobbered).

But then you need some "bridge" code to connect commands between themselves, either using strictly variables in memory (having then no bridge code at all), or reusing/sharing some values between commands, just moving them into proper register (your "non sense" mov rN,rN instructions, saving some fetch instructions from memory).

Compilation stage(s) optimizing register allocations (trying to increase sharing/reusing of registers, reassigning registers for some commands and compiling them again, sometimes even reordering blocks of commands to make the register sharing more optimal) is non-trivial task and an time consuming compilation step, which is not required for the code to work. The debug compilation skips this step to produce binary faster.

Also in debug build it's desirable to store variable values into it's memory after each source command, to make results visible in debugger, although in optimized release build the compiler may recognize the "intermediary" nature of some results, and keeps them temporarily in registers only.

Upvotes: 1

Related Questions