Reputation: 131808
If I compile the following C++ program:
int baz(int x) { return x * x; }
in clang 15, I get:
baz(int):
mov eax, edi
imul eax, edi
ret
while gcc 12.2 gives me:
baz(int):
imul edi, edi
mov eax, edi
ret
(See this on GodBolt)
Are these two implementations entirely equivalent, and merely a matter of arbitrary choice? If they're not equivalent, how can their difference manifest, or affect my program? I mean, in terms of CPU-state side-effects, latencies of other instructions, behavior during inlining etc.
Upvotes: 3
Views: 184
Reputation: 364822
Do mov
then imul
because it's better with mov-elimination, and not worse anywhere for any other reason.
This is true in general for mov
/and
, mov
/sub
, etc, as long as you don't have a use for the original value. If you do, then sometimes mov
to make a copy and then modify the original to hide mov
latency for CPUs without move elimination. (mov
/add
or small shift should normally be lea
).
mov
then imul
is strictly better; overwriting a mov reg,reg
result lets Intel CPUs free some resources they use to track mov elimination. (Probably something like a reference count for extra references beyond the normal RAT.) This increases the likelihood of later mov-eliminations being successful. See How do *move elimination* slots work in Intel CPU?
All else essentially equal (as in this case), prefer to mov
then overwrite its result, especially when that doesn't make things worse for CPUs without mov-elimination (like Ice Lake, thanks Intel.)
It doesn't have to be in the next instruction, just sometime soon, preferably not left indefinitely e.g. for a long-running loop. But even that isn't a disaster usually.
To measure this benefit, a microbenchmark would probably need to do a lot of mov
instructions that don't overwrite their result, to run the CPU out of mov-elimination slots and have some of them need an execution unit. The microbenchmark would also need to be sensitive to the latency of those mov
instructions, since most modern Intel CPUs have enough execution units to keep up with the issue/rename width in terms of throughput.
mov reg,reg
has 1 cycle latency. If you'd been doing x*y
with two separate inputs, mov
then imul
makes that latency part of the input->output latency for one input but not the other. The other has an extra cycle to become ready before the imul
would have to wait for it, if out-of-order exec would tend to have one input ready before the other.
(A compiler would typically have no way to guess which input was the result of a long dep chain vs. a mov
-immediate when compiling a non-inline function, but a 50/50 chance of winning a cycle is better than having the mov
always on the critical path after the imul
.)
But with x*x
without mov-elimination, the only difference is that we're writing both EDI and EAX, instead of writing EAX twice. I don't think that's significant in terms of using up physical-register-file (PRF) entries or freeing them sooner. Since most code-gen is trying to be good across multiple CPUs, favour mov then imul because some CPUs do have mov-elimination. It's essentially a tie for CPUs without, when you're squaring one variable.
On a CPU that does partial register renaming, writing a register might free up two physical-register-file (PRF) entries instead of just one. (While allocating a new PRF entry either way.) But just reading the full register would already insert a merging uop.
Intel Sandybridge-family is the only x86-64 microarchitecture that does partial-register renaming and uses a PRF. Intel P6 family (Nehalem and earlier) keeps results right in the ROB, associated with the uop that produced them, until commit to a separate "retirement register file"; this is why it has register-read stalls when you read too many "cold" registers. Only Sandybridge itself (and possibly Ivy Bridge) rename low-8 registers like DIL and DL separate from full registers; on Haswell/Skylake and later only high-8 registers like DH get renamed separately.
Anyway, DIL might have been renamed separately from the full RDI. There is no DIH equivalent of DH or CH, since we're talking about EDI not EDX or ECX (the next two arg-passing registers), and gcc/clang very rarely generate code that writes high-8-bit registers. (Why doesn't GCC use partial registers?)
But either mov/imul or imul/mov will merge DIL into RDI before EDI is read, whether it's written or not (by the same imul uop). Same for DH on Haswell and later if we had an arg in EDX.
Upvotes: 5