Reputation: 2257
Currently using this 64-bit MASM code to call a C runtime function such as memcmp(). I recall this convention was from a GoAsm article on optimizations.
memcmp PROTO;:QWORD,:QWORD,:QWORD
PUSH RSP
PUSH QWORD PTR [RSP]
AND SPL,0F0h
MOV R8,R11
MOV RDX,R10
MOV RCX,RAX
SUB RSP,32
CALL memcmp
LEA RSP,[RSP+40]
POP RSP
Is this a valid optimized version below?
memcmp PROTO;:QWORD,:QWORD,:QWORD
PUSH RSP
PUSH QWORD PTR [RSP]
AND RSP,-16 ; new
MOV R8,R11
MOV RDX,R10
MOV RCX,RAX
LEA RSP,[RSP-32] ; new
CALL memcmp
LEA RSP,[RSP+40]
POP RSP
The justification for replacing
AND SPL,0F0h
with
AND RSP,-16
is that it avoids invoke partial register updates. Understanding fastcall stack frame
Replacing
SUB RSP,32
with
LEA RSP,[RSP-32]
is that ensuing instructions do not depend on the flags being updated by the subtraction
then not updating the flags will be more efficient as well.
Why does GCC emit "lea" instead of "sub" for subtraction?
In this case, are there other optimization tricks too?
Upvotes: 0
Views: 320
Reputation: 364210
AND
yes, the original code was silly and not saving any code-size (SPL takes a REX prefix, too, like 64-bit operand-size).
LEA
- pointless and a waste of code-size: x86 CPUs already avoid false dependencies on FLAGS via register renaming; that's necessary to efficiently run normal x86 code which is full of instructions like add
, sub
, and
, etc. Compilers would use lea
much more heavily if that wasn't the case. The answer on that linked Q&A is wrong and should be downvoted / deleted. The only danger is on a few less-common CPUs (Pentium 4 and Silvermont for different reasons) from instructions like inc
that only write some flags. (INC instruction vs ADD 1: Does it matter?). Even the cost of inc
on Silvermont-family is pretty minor, just an extra uop but not during decode, so it doesn't stall.
add
is not slower than lea
on any CPUs, either itself or in its influence on later instructions. (Except in-order Atom pre-Silvermont, where lea
ran earlier in the pipeline than add
(on an actual AGU), so it could be better or worse depending on where data was coming from / going to). You'd only use lea
in some cases like an adc
loop where you actually need to keep CF unchanged so next iteration can read it. i.e. to not mess up a true dependency (RAW), nothing to do with avoiding a false (WAW) output dependency. (See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs - note that cases where adc
/ inc
/ adc
creates a partial-flag stall are cases where add
would cause a correctness problem, so I'm not counting that as a case where add
would make later instructions faster.)
You probably don't need to save the old RSP; the ABI requires 16-byte stack alignment before a call, and that includes your caller (unless you're getting called from code that doesn't follow the ABI, so you don't have known RSP alignment relative to a 16-byte boundary).
Normally you'd just do sub rsp, 40
like a compiler would, to realign RSP and reserve space for the shadow space. (And you'd do this at the top/bottom of the function, not around every call, along with saving/restoring call-preserved registers).
(In practice memcmp
is unlikely to care about stack alignment, unless it needs to save/restore some more XMM regs. The Windows x64 calling convention unwisely only has 6 call-clobbered x/ymm registers, and that might be slightly tight depending on how much loop unrolling they do in a hand-written(?) memcmp
.)
And even if you did need to handle an unknown incoming RSP alignment, saving RSP to two different locations for pop rsp
is still not a very efficient way to go about it. Normally you'd just use RBP to make a traditional frame pointer to clean up with mov rsp, rbp
/ pop rbp
, which works regardless of unknown adjustment to RSP. e.g. even in functions that use alloca
(or in asm, that do an unknown number of pushes or variable-sized sub rsp
, which is effectively the same thing as and rsp, -16
).
Upvotes: 4