Why XOR before SETcc?

Question

This fragment of code

int foo(int a, int b)
{ 
    return (a == b);
}

generates the following assembly (https://godbolt.org/z/fWsM1zo6q)

foo(int, int):
        xorl    %eax, %eax
        cmpl    %esi, %edi
        sete    %al
        ret

According to https://www.felixcloutier.com/x86/setcc

[SETcc] Sets the destination operand to 0 or 1 depending on the settings of the status flags

So what is the point of initializing %eax with zero by doing xorl %eax, %eax first if it will be zero/one depending on result of a == b anyway? Isn't it a waste of CPU clocks that both gcc and clang can't avoid for some reason?

Peter Cordes · Accepted Answer

Because setcc sucks: only available in 8-bit operand-size. But you used 32-bit int for the return value, so you need that 8-bit result zero-extended to 32-bit.

Even if you did only want to return a bool or char, you might still do this to avoid a false dependency when writing AL. xor-zeroing doesn't cost "a cycle", it costs 1 uop (and is as cheap as a nop on Intel), but that's still not free. (https://agner.org/optimize/)

Unfortunately AMD64 didn't change setcc, nor did any later extensions, so producing a 32-bit 0/1 is still a pain on x86 even with -march=icelake-client or znver3. Having a 66 operand-size or rep prefix modify setcc to use 32-bit operand-size would have been helpful to avoid wasting an instruction (and front-end uop) for this, but neither vendor has ever bothered to introduce an extension like that. (Usually only extensions that can give major speedups in a few "hot" functions that you can do dynamic dispatch for, not things that need to be used everywhere to add up to a small improvement.)

xor-zeroing before the setcc is the least-bad way, when you have a spare register, as discussed at the bottom of my answer on What is the best way to set a register to zero in x86 assembly: xor, mov or and?.

The other options, if you do want to overwrite a compare input include:

1. mov-imm32=0 which you can do after a compare, not affecting FLAGS:

# for example if you want to replace a compare input with a boolean
    cmp    %ecx, %eax
    mov    $0, %eax
    setcc  %al

This wastes code-size (5 bytes vs. 2 for mov vs. xor), and (on Intel P6-family) has a partial register stall when reading EAX, because no xor-zeroing was used to set the internal RAX=AL upper-bytes-known-zero state.

The mov-immediate is off the critical path, so out-of-order exec can get it done early, before the compare inputs are ready, and have that zeroed register ready for setcc to write into.

(On Intel SnB-family CPUs, xor-zeroing is handled in the rename logic, so it doesn't have to execute early to get the zero ready; it's already done when it enters the back-end. e.g. after a front-end stall, xor-zeroing and setcc could enter the back-end in the same cycle, but the setcc could still execute in the first cycle after that, unlike if it was a mov-immediate that would have to actually run on a back-end execution unit to write a zero to a register.)

2. MOVZX on an 8-bit setcc result

    cmp    %ecx, %eax
    setcc  %cl
    movzbl %cl, %eax

This is mostly even worse, except on P6-family where it avoids a partial-register stall.

But movzx is on the critical path from compare inputs being ready to 0/1 result being ready. (Although IvyBridge and later can run it with zero latency when it's between two separate registers, which is why I used %cl instead of %al. Compilers normally don't optimize for this, and would setcc %al / movzbl %al, %eax if they don't manage to xor-zero something first. This defeats mov-elimination even on Intel CPUs that have it.)

setcc %cl has a false dependency on RCX (except on Intel P6-family which renames low8 registers separately from the full register), but that's ok because RCX and RAX were both already part of the dependency chain leading to setcc.

If you're not overwriting one of the compare inputs, xor-zero the separate destination register. setcc %al / movzbl %al, %eax after cmp %esi, %edi would be the worst of all possible options, because RAX might have last been written by a cache-miss load of something independent, or a slow div or something like that before the function call, so you could be coupling this dependency chain into it.

Update: APX will fix this

Intel's APX extension (planned for Granite Rapids, see https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html for the manual) will have a version of setcc that zeros the upper part of the register, so something like setcc rax.

Unfortunately the good version needs a 4-byte EVEX prefix; with just a 2-byte REX2 you can only get the existing bad semantics of setcc.

Still, it totally solves the problem with similar machine-code size to xor+setcc, and more importantly no need for any register setup ahead of the FLAGS-setting instruction, so compilers don't have to work at it.

So maybe in another 2 decades or more if/when APX becomes baseline for most builds (outside of JIT and -march=native builds that only need to run on the current machine), we can finally be mostly free of this wart inherited from 386, which AMD64 declined to fix. If AArch64 or RISC-V haven't replaced x86-64 by then. (AArch64 has a very nice csinc conditional select-increment instruction, which can be used with its zero-register to materialize a 0 or 1 in a single instruction, or to conditionally increment something directly.)

Why XOR before SETcc?

Answers (1)

Update: APX will fix this

Related Questions