Reputation: 533
I know that movzx
can be used for dependency breaking, but I stumbled on some movzx
uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
with GCC 12 -O3
:
add2bytes(unsigned char*, unsigned char*):
movzx eax, BYTE PTR [rsi]
add al, BYTE PTR [rdi]
movzx eax, al
ret
If I understand correctly, the first movzx
here breaks dependency on previous eax
value, but what is the second movzx
doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.
with clang 14 -O3
, it's even more weird:
add2bytes(unsigned char*, unsigned char*): # @add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
It uses mov
where movzx
seems more reasonable, and then zero extends al
to eax
, but wouldn't it be much better to do movzx
at the start?
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1
GCC generates both sensible and strange movzx
, and Clang's use of mov r8 m
and movzx
just makes no sense to me. I also tried adding -march=skylake
to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.
The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx
uses that seem useless and/or out of place.
Do the compilers really use movzx
poorly here, or am I missing something?
Edit: I have opened bug reports for Clang and GCC:
https://github.com/llvm/llvm-project/issues/56498
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Temporary workarounds using inline assembly: https://godbolt.org/z/7qob8G3j7
#define addb(a, b) asm (\
"addb %1, %b0"\
: "+r"(a) : "mi"(b))
int add2bytes(uint8_t* a, uint8_t* b) {
int ret = *a;
addb(ret, *b);
return ret;
}
Now Clang -O3 produces:
add2bytes(unsigned char*, unsigned char*): # @add2bytes(unsigned char*, unsigned char*)
movzx eax, byte ptr [rdi]
add al, byte ptr [rsi]
ret
Upvotes: 53
Views: 3235
Reputation: 364210
Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).
The optimal asm is what you suggest, movzx
load, then byte add, leaving a uint8_t
result in the low byte, correctly zero-extended to int
as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx
being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)
A movzx
is necessary somewhere, but it can be in the initial load. (A movzx
is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx
right after.)
There are basically three relevant behaviours here, among x86-64 CPUs.
Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic
should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)
Returning an int
when the last instruction wrote al
will likely lead to a penalty when the caller reads EAX. But mov al, [rdi]
can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.
mov al, [rdi]
still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al
result with the high bytes of RAX from movzx eax, [rdi]
) will get inserted just as cheaply as if we'd put a movzx eax, al
in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.
mov al, [rdi]
sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.
Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.
GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx
to avoid a false dep writing a partial reg. And a final movzx
to avoid a partial-register stall in the caller.
But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov
-elimination on movzx r32, r8
, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi
on a function arg, for example.
movzx edx, byte ptr [rdi]
add dl, [rsi]
movzx eax, dl # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).
If optimizing specifically for Core2 / Nehalem, you'd do this:
xor eax, eax # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
mov al, [rdi]
add al, [rsi]
That's not bad on later CPUs, although the mov al, [rdi]
would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx
if you pick different registers.
GCC's choice to use movzx
because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic
in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32
is to make code for old 32-bit-only CPUs.)
It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march
/ -mtune=
k8
through znver3
, or silvermont
-family, or sandybridge
or newer.
(Also note that some choices which should differ based on -mtune
setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2
still doesn't know to avoid partial-register stalls!)
Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).
Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx
, then realized at the end it needed to zero-extend the result.
An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.
Probably in general better to start doing narrow loads with movzx
so that's more normally the case.
You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues
Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)
See also:
movzx eax, dl
, and that AH-merging was free; see my Q&A about HSW/SKL. But Agner's guide is accurate AFAIK for earlier CPUs.)xor eax,eax
sets some kind of internal EAX=AL flag.I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me.
clang's mov ecx, edx
zero-extension from 32 to 64 instead of from 8 to 64 is because it depends on an unofficial extension to the x86-64 SysV calling convention, that narrow args are extended to 32-bit. AMD Zen CPUs can do mov-elimination on mov ecx, edx
but not for movzx
-byte, so this is actually more efficient, as well as saving code-size.
(GCC and clang both make callers that respect this unofficial calling-convention feature, but only clang makes callees that depend on it. ICC doesn't do either so is not ABI-compatible with clang.)
Extension to intptr_t
is of course necessary for all narrower args if you're going to index an array with one. (In abstract C terms, this is just part of using the value for pointer math). High garbage is allowed in at least the high 32 bits of the 64-bit register.
Upvotes: 53
Reputation: 179819
The clang bit actually seems reasonable. You get a partial register stall if you write to al
and then read from eax
. Using movzx
breaks this partial register stall.
The initial mov to al
has no dependencies on existing values of eax
(due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).
In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.
[EDIT]
As for GCC, it seems a particularly bad choice. If it had chosen bl
as the temporary register, that last movzx
would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.
Upvotes: 8