Strange uses of movzx by Clang and GCC

Question

I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:

#include 

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a + *b);
}

with GCC 12 -O3:

add2bytes(unsigned char*, unsigned char*):
        movzx   eax, BYTE PTR [rsi]
        add     al, BYTE PTR [rdi]
        movzx   eax, al
        ret

If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.

with clang 14 -O3, it's even more weird:

add2bytes(unsigned char*, unsigned char*):                       # @add2bytes(unsigned char*, unsigned char*)
        mov     al, byte ptr [rsi]
        add     al, byte ptr [rdi]
        movzx   eax, al
        ret

It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?

I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov r8 m and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.

The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.

Do the compilers really use movzx poorly here, or am I missing something?

Edit: I have opened bug reports for Clang and GCC:

https://github.com/llvm/llvm-project/issues/56498

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277

Temporary workarounds using inline assembly: https://godbolt.org/z/7qob8G3j7

#define addb(a, b) asm (\
    "addb %1, %b0"\
    : "+r"(a) : "mi"(b))

int add2bytes(uint8_t* a, uint8_t* b) {
    int ret = *a;
    addb(ret, *b);
    return ret;
}

Now Clang -O3 produces:

add2bytes(unsigned char*, unsigned char*):                       # @add2bytes(unsigned char*, unsigned char*)
        movzx   eax, byte ptr [rdi]
        add     al, byte ptr [rsi]
        ret

Peter Cordes · Accepted Answer

Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).

The optimal asm is what you suggest, movzx load, then byte add, leaving a uint8_t result in the low byte, correctly zero-extended to int as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)

A movzx is necessary somewhere, but it can be in the initial load. (A movzx is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx right after.)

There are basically three relevant behaviours here, among x86-64 CPUs.

Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)

Returning an int when the last instruction wrote al will likely lead to a penalty when the caller reads EAX. But mov al, [rdi] can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.

mov al, [rdi] still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al result with the high bytes of RAX from movzx eax, [rdi]) will get inserted just as cheaply as if we'd put a movzx eax, al in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.

mov al, [rdi] sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.

Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.

GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx to avoid a false dep writing a partial reg. And a final movzx to avoid a partial-register stall in the caller.

But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov-elimination on movzx r32, r8, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi on a function arg, for example.

   movzx  edx, byte ptr [rdi]
   add     dl, [rsi]
   movzx  eax, dl             # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).

If optimizing specifically for Core2 / Nehalem, you'd do this:

   xor   eax, eax      # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
   mov    al, [rdi]
   add    al, [rsi]

That's not bad on later CPUs, although the mov al, [rdi] would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx if you pick different registers.

GCC's choice to use movzx because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32 is to make code for old 32-bit-only CPUs.)

It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march / -mtune= k8 through znver3, or silvermont-family, or sandybridge or newer.

(Also note that some choices which should differ based on -mtune setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2 still doesn't know to avoid partial-register stalls!)

Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).

Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx, then realized at the end it needed to zero-extend the result.

An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.

Probably in general better to start doing narrow loads with movzx so that's more normally the case.

You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues

Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)

Strange uses of movzx by Clang and GCC

Answers (2)

Related Questions