Reputation: 390
In x86 assembly, is it possible to remove a value from the stack without storing it? Something along the lines of pop word null
? I could obviously use add esp,4
, but maybe there's a nice and clean cisc mnemonic i'm missing?
Upvotes: 13
Views: 7654
Reputation: 364180
add esp,4
/ add rsp,8
is the normal / idiomatic / clean way. No special way is needed because stacks aren't magical or special (at least not in this respect); it's just a pointer in a register with some instructions that use it implicitly. (And for kernel stacks, interrupts use it asynchronously so software couldn't implement a kernel red-zone even if it wanted to...)
Other than that, the magical CISC way to clean up a whole stack frame at the end of a function is leave
= mov esp, ebp
/ pop ebp
(or the 16 or 64-bit equivalent). Unlike enter
, it's fast enough on modern CPUs to be usable in practice, but still a 3 uop instruction on Intel CPUs. (http://agner.org/optimize/). But leave
only works in the first place if you spent extra instructions making a stack frame with ebp
/ rbp
in the first place. (Usually you wouldn't do that, unless you need to reserve a variable amount of stack space, e.g. with push
in a loop to make an array, or the equivalent of a C99 VLA or alloca
. Or for beginner code to make access to locals easier, or in 16-bit mode where SP
can't be used in addressing modes.)
The magical CISC way to clean up stack-args is for the callee to use ret imm16
(costing 1 extra uop) to pop the args, creating a calling convention where the callee cleans the stack. In a caller-pops calling convention, there's no way to use this form of ret
, but you can simply leave the stack offset and use mov
to store args for the next function call instead of push
(if the function needs any stack-args at all; register-arg calling conventions are generally more efficient.)
So the magic CISC ways have no performance advantage on modern CPUs, only minor code-size.
There are 2 reasons you might use pop reg
instead of add esp,4
:
pop r32/r64
is a one-byte instruction, vs. 3 bytes for add esp,4
or 4 bytes for add rsp,8
.performance: Intel's stack engine has to insert extra stack-sync uops when you use esp
/ rsp
explicitly after a stack instruction (push/pop/call/ret). So after a call
(which returns with a ret
), it saves a uop to use pop
instead of add esp,4
before you ret
at the end of the function.
AMD's stack engine doesn't need extra stack-sync uops, but still makes push/pop single-uop instructions. Unlike on older Intel/AMD CPUs, where push/pop cost more than plain mov
loads/stores, needing a separate uop for the stack-pointer modification. And creating a data dependency on the stack pointer.
See Why does this function push RAX to the stack as the first operation? for more details about performance.
If you were looking for aesthetics, well you can indent, format, and comment your code nicely, but beyond you chose the wrong language when you picked x86 asm if aesthetics outweigh optimization.
Of course, if you need to adjust the stack by more than 1 register-width, definitely use add
if you don't need the data that pop
would load. Or, if you need to adjust it by +128 bytes, use sub esp, -128
, because -128
is encodable as a sign-extended imm8, but +128 isn't.
Or maybe use lea esp, [esp+4]
, like gcc does with -mtune=atom
. (For in-order atom, not silvermont). Like I said, if you wanted clean, you shouldn't have picked x86 asm.
You can almost always find a dead register to pop
into. If you need to adjust E/RSP by one stack slot before popping some registers you actually wanted to pop, you can always pop the same register twice.
In the extremely rare case where none of the 7 (x86-32) or 15 (x86-64) non-stack register are available as pop
destinations, this optimization is not available and you should simply use the traditional add
. It's not worth spending extra instructions to make it possible to pop
; that would outweigh the minor benefit of using pop
.
Note that pop Sreg
(segment register) still consumes the regular "stack width" (32 or 64 bits, depending on mode), rather than only 16 for a 16-bit register. But only pop ds/es/ss
are single-byte. pop fs/gs
are 2 bytes each. So if you're optimizing for code-size, pop gs
is 1 byte smaller than add esp,4
, but much much slower. (Or 2 bytes smaller than add rsp,8
).
Upvotes: 18