Reputation: 385
I'm studying x86-64 NASM and here is current situation:
For first, I wrote somewhat straight, easy-to-read code. However, I found somewhat "clever" ways to initialize registers to reduce instruction length.
I want to know whether these clever things could bring real reward, or do more harm than good.
This is the first code with straight way:
.loop:
mov rax, -1
mov rdx, 1 ; **
mov rsi, 2 ; **
; ... loop body
dec rcx
jnz .loop
(**: The assembler actually emitted these lines as mov edx, 1
and mov esi, 2
. Later I found that the assembler optimized them for me because writing EDX/ESI will zero-out the upper 32 bits of RDX/RSI.)
These are 17 bytes of beginning and 5 bytes of ending.
This is the second code with clever way:
.loop:
xor eax, eax
dec rax
lea edx, [rax+2] ; ***
lea esi, [rdx+1] ; ***
; ... loop body
loop .loop
(***: I tried various combinations of 32-bit / 64-bit registers and these had the shortest instruction length.)
These are 11 bytes of beginning and 2 bytes of ending.
Upvotes: 6
Views: 183
Reputation: 365517
Using LEA to derive some other constants from the first one is often a good tradeoff of saving a byte or two of code-size vs. mov reg, imm32
. Especially in code outside a loop.
AMD Zen, and Ice Lake and later, have very good LEA throughput and latency for simple addressing modes with one register plus an imm8 or imm32, able to run on all four integer ALU units. (https://uops.info/ / https://agner.org/optimize/). On Skylake and earlier, LEA throughput with simple addressing modes was 2/clock.
MSVC does this optimization, as in Advantage of using LEA over MOV for passing parameters in Assembly compiled from C++ where my answer discusses the advantages. But note that MSVC's asm has three LEAs that can run in parallel, all dependent only on the initial MOV-immediate. Unlike yours where each instruction depends on the result of the previous one, creating a longer dependency chain than necessary and thus reducing instruction-level parallelism (ILP).
A better way to materialize the same register values would be to start with the positive one so you can use a 5-byte mov
with 32-bit operand-size, then use lea
s with 32 or 64-bit operand-size as necessary to create the others.
If any of your constants are zero, starting with it allows just an xor
-zeroing instruction, which is about as cheap as a NOP on recent AMD and Intel, no back-end uop and no latency from issue to dependent instructions being able to read the result, so LEAs depend on it can start executing as early as the same cycle they and the xor-zeroing were sent to the back-end (if this code happens to be running after an I-cache miss or other stall, so there aren't older uops waiting for execution ports).
When you're optimizing for speed, it's usually best to minimize uops (and instructions) when all else is equal. xor
-zero / dec
is 2 uops from two instructions and is 5 bytes long. (Since this is 64-bit code, the 1-byte encoding of dec r32
isn't available; x86-64 repurposes those 0x4? bytes from inc/dec to REX prefixes. So dec rax
is REX + opcode + ModRM for 3 bytes total). This would save 2 bytes vs. mov rax, -1
(7 bytes for REX + opcode + modrm + imm32), but starting with a positive constant allows 5-byte mov r32, imm32
with the special no-modrm opcode for a 5-byte instruction. We end up requiring a REX prefix on one of the LEAs, but those already needed ModRM bytes. I think the potential net saving would only be one byte of code size for starting with xor-zero/dec rax
then two 3-byte LEAs, vs. what I'm showing below, at the cost of an extra uop.
mov edx, 1 ; (5 bytes, zero-extending into RDX)
lea esi, [rdx + 1] ; (3 bytes, zero-extending into RSI)
lea rax, [rdx - 2] ; (4 bytes since we need a REX prefix to write non-zero bits to the high half of RAX)
If your loop could manage with just EAX=-1 (RAX=0x00000000FFFFFFFF), you could use 32-bit operand-size for that, too, for another 3-byte LEA. (The advantages of using 32bit registers/instructions in x86-64)
You say your loop counter is known to be between 1 and 1000, so you could use dec ecx
/ jnz
to save a byte of code-size there.
(Saving code-size isn't always better; it aligns later code differently which affects how it will pack into the uop cache. But on average smaller is better and will almost always benefit the L1i cache. There are also effects on some uarches like Skylake's performance pothole due to a microcode workaround for an erratum: How can I mitigate the impact of the Intel jcc erratum on gcc?)
Upvotes: 3
Reputation: 93127
Whether it's a good idea to do this or not depends on your objective. Usually, it is not a good idea.
If your objective is ease of understanding, you should avoid these tricks as they make your code harder to understand.
If your objective is code size reduction, it might indeed be a good idea to make use of such tricks. You can do even better than you already did though; for example, you could do or rax, -1
to set rax
to -1
with only 4 bytes. Or push -1
followed by pop rax
for only 3 bytes.
However, usually the objective is performance. Now when you optimise for performance, some tricks help, but others are detrimental. In particular, all the tricks you showed us in your question are detrimental to performance:
clearing and then decrementing a register takes just as long as or is a bit slower than setting the register to -1
directly, depending on microarchitecture. I would avoid it anyway as two instructions take up more decoder bandwidth than one instruction.
deriving registers from other registers rather than setting them directly does not take more time per se, but as you introduce a dependency on the other register, these initialisations must now be performed after the other register is set rather than in parallel. This can reduce performance on out-of-order architectures and should be avoided, but sometimes it may still be beneficial. Design your code such that as many operations as possible can be done in parallel.
the loop
instruction is well-known to be a slow one and should be avoided. But so should dec
followed by a conditional branch: as dec
performs a partial flag update, a penalty exists on some microarchitectures if the flags are read subsequently. Use sub rcx, 1
instead if you want to evaluate the flag result.
Note that when optimising for performance, occasionally it might still be a good idea to optimise for size. This is because longer code sequences take more space in the instruction cache, blocking other code from being cached. In big programs whose hot code paths do not entirely fit into L1 instruction cache, performance can benefit from code size optimisations, especially in cold paths that are rarely executed. However, this is a tricky thing to evaluate and strategies must be adapted to the case at hand. Let benchmarks guide your decisions in any case.
Upvotes: 6