Reputation: 3274
I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest
compiler to see the resulting assembly:
#include <cstdint>
void f(std::uint32_t, std::uint32_t, std::uint32_t, std::uint32_t);
void test()
{
f(1, 2, 3, 4);
}
and got this result:
void test(void) PROC
mov edx, 2
lea r9d, QWORD PTR [rdx+2]
lea r8d, QWORD PTR [rdx+1]
lea ecx, QWORD PTR [rdx-1]
jmp void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP
What I do not understand is why did the compiler chose to use lea
instead of a simple mov
for this example. I understand the mechanics of lea
and how it results in the correct values in each register, but I would have expected something more straightforward like:
void test(void) PROC
mov ecx, 1
mov edx, 2
mov r8d, 3
mov r9d, 4
jmp void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP
Moreover, from my little understanding of how modern CPUs work, I have the feeling that the version using lea
would be slower since it adds a dependency between the lea
instructions and the mov
instruction.
clang
and gcc
both gives the result I expect, i.e., 4x mov
.
Upvotes: 4
Views: 281
Reputation: 58673
MSVC's code is smaller than the naive mov
approach. (But as you point out, because of the dependency, it may potentially be slower; you would have to test that.)
1 bits 64
2 00000000 BA02000000 mov edx, 2
3 00000005 448D4A02 lea r9d, QWORD [rdx+2]
4 00000009 448D4201 lea r8d, QWORD [rdx+1]
5 0000000D 8D4AFF lea ecx, QWORD [rdx-1]
6
7 00000010 B901000000 mov ecx, 1
8 00000015 BA02000000 mov edx, 2
9 0000001A 41B803000000 mov r8d, 3
10 00000020 41B904000000 mov r9d, 4
mov ecx, 1
is 5 bytes: one byte for the opcode B8-BF which also encodes the register, and 4 bytes for the 32-bit immediate. In particular, unlike for some arithmetic instructions, there is no option for mov
to encode a smaller immediate with fewer bytes using zero- or sign-extension.
lea ecx, [rdx-1]
is 3 bytes. One byte for the opcode; one MOD R/M byte which encodes the destination register ecx
and the base register rdx
for the effective address of the memory operand; and (here is the key) one byte for an 8-bit sign-extended displacement.
The instructions using r8,r9
need one extra byte for a REX prefix; but that's true for both mov
and lea
so it's a wash.
Upvotes: 8
Reputation: 365557
lea r32, [reg+disp8]
is 3 bytes, vs. mov r32, imm32
being 5 bytes.
See Tips for golfing in x86/x64 machine code and Nate's answer.
x86 is unfortunately missing a mov reg, sign_extended_imm8
. All else equal (or nearly equal), smaller code size is usually better, especially in "cold" code that might have to come from legacy decode. (And also for I-cache / iTLB footprint reasons.)
Cool, I didn't realize any compilers were using this code-size optimization for materializing constants in registers. Nice job, MSVC. GCC and Clang should be doing this, too, at least with -Os
. Probably even for -O2
/-O3
; there will be some cases where it's not a win but I expect it's good on average on most CPUs.
GCC/clang -Oz
use push imm8
/pop reg
for code-size optimization even at significant cost to performance; Godbolt. That's also 3 bytes, but much less efficient.
Intel since Ice Lake has 4/clock lea
(with simple addressing modes), and Zen has always had that. Previously 2/clock LEA throughput on Skylake and earlier, but still only 1 cycle latency. (https://uops.info/)
I have the feeling that the version using
lea
would be slower since it adds a dependency between the lea instructions and themov
instruction.
All 3 read the mov
-immediate result from RDX, so there's good instruction-level parallelism, not a chain of dependencies. And RDX started a new dependency chain, so it can execute as early as the cycle after the front-end issues it.
By the time instructions after the jmp
that read the results are in the pipeline, the lea
s can already have executed if there are any spare cycles on the execution units they're scheduled to. (Or if there's lots of independent work in the pipeline and we're just bottlenecked on back-end ALU throughput, then the instructions in the tailcalled function wouldn't get a cycle on an execution unit either. Unless maybe it was a load instead of ALU, or an execution port that wasn't busy... But then mov
-imm would have had the same problem, just waiting for ALU execution port throughput, not latency.)
(uops are scheduled oldest-ready first, so under normal conditions where the front-end is fairly far ahead of the oldest instructions being executed, independent work like this can usually find a gap.)
If any of the instructions using these constants use it with data coming from older instructions, it's very likely that latency of materializing the constants will be a non-issue. I think it's very unlikely that the extra latency before R8/R9/RCX are ready would end up costing cycles in a modern out-of-order exec x86.
It's a little odd that it put the lea
for ECX last, though; many functions look at their first arg first, so you'd want that to be the mov
-immediate or the first lea
. All three lea
s can execute in parallel, but the last ones might get issued by the front-end a cycle later. And with oldest-ready-first scheduling, if any get scheduled to the same port (because the number of uops waiting for all other ports are high) then they'll have a resource conflict and have to take turns.
I wonder if the compiler's algorithm was to pick a middle value to make it more likely that all the values were in range of [reg+disp8]
compact addressing modes. (Hopefully it also prefers to pick a "legacy" register so REX prefixes can be minimized; if it had picked R8, all three LEAs would have needed a REX.)
If execution-port pressure is fairly even, they might not all get scheduled to different ports when issuing in the same cycle. See x86_64 haswell instruction scheduled on already used port instead of unused one for details on how Haswell schedules multiple uops in the same cycle. So this could create a resource conflict, making one of the lea
results not ready until 2 cycles after the mov
result was ready. (2 cycles where that port was free, if there are even older uops in the ROB that just had some gaps.)
So that's not very definitive, but my intuition is that this won't be a problem in practice. I'd guess (and hope) that MSVC developers profiled it on some existing codebases and didn't find any serious performance regressions, and hopefully found some minor overall speedups on average.
Upvotes: 6