Reputation: 39784
I have the following code:
char swap(char reg, char* mem) {
std::swap(reg, *mem);
return reg;
}
I expected this to compile down to:
swap(char, char*):
xchg dil, byte ptr [rsi]
mov al, dil
ret
But what it actually compiles to is (at -O3 -march=haswell -std=c++20
):
swap(char, char*):
mov al, byte ptr [rsi]
mov byte ptr [rsi], dil
ret
See here for a live demo.
From the documentation of xchg
, the first form should be perfectly possible:
XCHG - Exchange Register/Memory with Register
Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.
So is there any particular reason why it's not possible for the compiler to use xchg
here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char
but I never get an xchg
in the compile output. How come?
Upvotes: 1
Views: 978
Reputation: 365277
TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.
xchg with mem has an implicit lock
prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) or code-golfed hand-written Bubble Sort as part of swapping 2 memory locations.
Possibly clang -Oz
could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz
does do stuff like push 1
/ pop rax
instead of mov eax, 1
to save 2 bytes of code size.
GCC -Os
won't use xchg
for swaps that don't need to be atomic because -Os
still cares some about speed.
Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov
instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).
See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg
, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg
on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg
. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)
I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi]
load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg
.
Upvotes: 9
Reputation: 9366
So is there any particular reason why it's not possible for the compiler to use xchg here?
Because mov
is faster than xchg
and compilers optimize for speed.
See:
Upvotes: 4