Reputation: 946
I know this question was submitted many times, but I couldn't find any clear answer yet, so I'm sorry, but I still have the doubt.
When programming in assembly language, for personal issues, not necessarily linked with external libraries, but wanting to comply with written or non-written conventions, what registers should I use and in what order?
I thought in using the same registers used by function arguments (RDI, RSI, RDX, RCX, R8, R9) to store values if I need them. Also, if I need more, I begin with R10-R15, but I'm not sure if this is a customary practice.
Also, what's the criteria to choose the stack before registers?
PS. Sorry again for asking a question that was submitted to the site a few times now, but I just want to do it "right". Thanks.
Upvotes: 1
Views: 373
Reputation: 26656
Also, what's the criteria to choose the stack before registers?
There are a few cases — the obvious one where you need to pass a parameter and this parameter must be passed on the stack (as you're out of the other parameter registers).
Further, as we do an analysis on variables (temporaries included) we may determine ones that are live across a call: defined before a call and used after a call. For variables that are not live across a call, we would favor call-clobbered registers.
The ones that are live across a call require storage that survives a function call, and this storage could be either call-preserved registers, or, stack-based memory — as both of these storage choices meet the criteria of surviving a function call.
If we determine that the variable that is live across a call has a low dynamic use count, then stack-memory is probably favored, and when a high dynamic use count, the call-preserved register is probably favored.
This is due to overhead associated with using call-preserved registers — namely that their incoming value must be preserved usually in function prologue and that same value restored, usually in function epilogue.
If the dynamic count of the references (either definitions or uses) to that variable is low, say, 2 (one definition and one use), then stack memory may be slightly more efficient than using a call-preserved register.
First an example with high dynamic usage count:
int sum = 0;
for ( int i = 0; i < len; i++ ) {
sum += f(i);
}
Here we may estimate that sum
has one definition (int sum = 0;
) plus one read and one write (sum += ...;
) per iteration of the loop. Because we might assume that the loop executes several iterations, repeated access to a register would be an improvement over access to memory — a call-preserved register is indicated. By using a call-preserved register, we effectively move the memory read/write operations to the prologue/epilogue for preserving such register, as compared with using memory directly within the loop.
In another example, we have the recursive fibonacci (wildly inefficient, but used here pedantically) as follows:
int fib(int n) {
if (n<=1) return n;
return fib(n-1) + fib(n-2);
}
Here there are two variables of particular concern, one is n
and the other is an unnamed temporary that holds the intermediate result from one of the recursive calls.
n
as a parameter is effectively defined upon function entry, and it is used three times (dynamically counting on the recursive path), but only one of those usages is actually live across a call. So, we would naturally use n
from the parameter register, but perhaps in prologue or perhaps before the first recursive invocation, we would move n
to call surviving storage. Due to the low usage count (defined before a call and used once after a call), stack-based storage is appropriate for this variable (as long as we use it from the incoming parameter register for as long as we can).
The temporary, holding the return value from the first recursive call also has a low usage count, defined before the second recursive call and used after (by the addition operator). Thus, it is also a better candidate for stack-based memory than for a call-preserved register. The overhead of preservation is just slightly higher than the alternative to use of stack-memory directly.
Let's also note that the return address is a good candidate for stack-based memory for two reasons, namely that's how the instruction set call/ret work (x86), and also that the usage scenario is that the return address often is live across some other call, yet only needed once (dynamically counting) at the end of the function. However, other instruction sets provide the return address in a register and that makes it like a register parameter that needs the same live-across-a-call analysis.
There are lots of caveats; using stack-based memory requires stack space to be allocated, however, let's note that several stack slots can be allocated in one go, somewhat mitigating the allocation/deallocation overhead (not to mention push & pop). Further, there can be situations where the overhead for a call-preserved register has already been paid, yet the register is available at the point and for the purpose being considered. Many other issues factor into the analysis depending on the ISA, e.g. x86-64 vs RISC V. For example, number of registers available that are categorized as call-preserved.
Upvotes: 1
Reputation: 365247
In any calling convention, use the call-clobbered registers before call-preserved regs, except for values that you want to survive across a function call. My answer on that linked Q&A covers a lot of what you're asking about how / when to use registers.
For x86-64 on OSes other than Windows, see What registers are preserved through a linux x86-64 function call for calling convention details like which registers are call-clobbered. (Not R12-R15).
Within the call-clobbered regs on x86-64, they're all equivalent for most purposes, although many instructions have smaller encodings for AL or EAX with an 8-bit or 32-bit immediate respectively. (And smaller machine-code size is generally better, all else equal, for better I-cache density and front-end decode throughput.)
Note that add eax, 4
is shortest using the standard 3-byte add r/m32, imm8
encoding, not 5-byte add eax, imm32
. AL for 8-bit immediate operations always saves space, but EAX for 32 or 64-bit operations only saves space for constants that don't fit in an imm8. Or for test eax, immediate
because there's no test r/m, imm8
encoding. But of course you can always use test al, 1
instead of test eax, 1
if you want the low bit; the only thing you lose out on is things like test eax, -128
to check all but the low 7 bits with sign-extension of a negative number.
See also Why are RBP and RSP called general-purpose registers? for details on extra code size for some addressing modes involving some registers (involving RBP, R12, and R13 as the base).
That RBP/RSP Q&A also mentions the fact that most of the "legacy" registers (not R8-R15) have some special instruction that uses them implicitly. Like RCX (specifically CL) being the only register for variable-count shift counts, like shr edx, cl
, unless you have BMI2 shrx edx, eax, esi
(which is large code-size but more efficient on Intel, being single uop).
Another case where all else is not equal: Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - even on Skylake, the adc al, 0
short-form encoding is 2 uops, for no apparent reason, only fixed in Alder Lake.
Upvotes: 2