Reputation:
This question is about x86 assembly but I provide an example in C because I tried to check what GCC was doing.
As I was following various assembly guides, I have noticed that people, at least the few whose materials I have been reading, seem to be in a habit of allocating stack variables closer to rsp than rbp.
I then checked what GCC would do and it seems to be the same.
In the disassembly below, first 0x10 bytes are reserved and then the result of calling leaf goes via eax to rbp-0xc and the constant value 2 goes to rbp-0x8, leaving room between rbp-0x8 and rbp for variable "q".
I could imagine doing it in the other direction, first assigning to an address at rbp and then at rbp-0x4, i.e. doing it in the direction of rbp to rsp, then leaving some space between rbp-0x8 and rsp for "q".
What I am not sure about is whether what I am observing is as things should be because of some architectural constraints that I better be aware of and adhere to or is it purely an artifact of this particular implementation and a manifestation of habits of the people whose code I read that I should not assign any significance to, e.g. this needs to be done in one direction or the other and it does not matter which one as long it is consistent.
Or perhaps I am just reading and writing trivial code for now and this will go both ways as I get to something more substantial in some time?
I would just like to know how I should go about it in my own assembly code.
All of this is on Linux 64-bit, GCC version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04). Thanks.
00000000000005fa <leaf>:
5fa: 55 push rbp
5fb: 48 89 e5 mov rbp,rsp
5fe: b8 01 00 00 00 mov eax,0x1
603: 5d pop rbp
604: c3 ret
0000000000000605 <myfunc>:
605: 55 push rbp
606: 48 89 e5 mov rbp,rsp
609: 48 83 ec 10 sub rsp,0x10
60d: b8 00 00 00 00 mov eax,0x0
612: e8 e3 ff ff ff call 5fa <leaf>
617: 89 45 f4 mov DWORD PTR [rbp-0xc],eax ; // <--- This line
61a: c7 45 f8 02 00 00 00 mov DWORD PTR [rbp-0x8],0x2 ; // <-- And this too
621: 8b 55 f4 mov edx,DWORD PTR [rbp-0xc]
624: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
627: 01 d0 add eax,edx
629: 89 45 fc mov DWORD PTR [rbp-0x4],eax
62c: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
62f: c9 leave
630: c3 ret
Here is the C code:
int leaf() {
return 1;
}
int myfunc() {
int x = leaf(); // <--- This line
int y = 2; // <-- And this too
int q = x + y;
return q;
}
int main(int argc, char *argv[]) {
return myfunc();
}
How I compile it:
gcc -O0 main.c -o main.bin
How I disassemble it:
objdump -d -j .text -M intel main.bin
Upvotes: 2
Views: 785
Reputation: 365950
It makes zero difference, do whichever you want for local variables that have to exist at all (because you can't optimize them into registers).
There is zero significance to what GCC is doing; it doesn't matter where the unused gap is (which exists because of stack alignment). In this case it's the 4 bytes at [rsp]
, aka [rbp - 0x10]
.
The 4 bytes at [rbp - 4]
are used for q
.
Also, you didn't tell GCC to optimize, so there's no reason to expect its choices to even be optimal or a useful guide to learn from. -O3
with volatile int
locals would make more sense. (But since there's nothing significant going on, still not actually helpful.)
The things that matter:
Local vars should be naturally aligned (dword values at least 4-byte aligned). The C ABI requires this: alignof(int) = 4. RSP before a call will be 16-byte aligned, so on function entry RSP-8 is 16-byte aligned.
Code size: As many as possible of your addressing modes can use small (signed 8-bit) displacements1 from RBP (or RSP if you address your locals relative to RSP like gcc -fomit-frame-pointer
).
This is trivially the case when you only have a few scalar locals, nowhere near 128 bytes of them.
Any locals you can operate on together are adjacent, and preferably not crossing an alignment boundary, so you can most efficiently init them both / all with one qword or XMM store.
If you have a lot of locals (or an array), group them for spatial locality if there's one whole cache line that might be "cold" while this function (and its children) are running.
Spatial locality: variables you use earlier in your function should be higher in the stack frame (closer to the return address which was stored by the call
to this function). The stack is typically hot in cache, but touching a new cache line of stack memory as it grows will be slightly less of an impact if its done after earlier loads/stores. Out-of-order exec can hopefully get to those later store instructions soon and get that cache-miss store into the pipeline to start an RFO (read for ownership) early, minimizing time spent with earlier loads clogging up the store buffer.
This only matters across boundaries wider than 16 bytes; you know everything within one 16-byte aligned chunk is in the same cache line.
A descending access pattern within one cache line might possibly trigger prefetch of the next cache line downward, but I'm not sure if that happens in real CPUs. If so, that might be a reason not to do this, and to favour storing first to the bottom of your stack frame (at RSP, or the lowest red-zone address you'll actually use).
If there's unused space for stack alignment before another call
, it's usually only 8 bytes at most. That's much smaller than a cache line and thus doesn't have any significant impact on spatial locality of your local variables. You know the stack pointer alignment relative to a 16-byte boundary, so the choice of leaving padding at the top or bottom of your stack frame never makes a difference between potentially touching a new cache cache line or not.
If you're passing pointers to your locals to different threads, beware false sharing: probably separate those locals by at least 64 bytes so they'll be in different cache lines, or even better by 128 bytes (L2 spatial prefetcher can create "destructive interference" between adjacent cache lines).
Footnote 1: x86 sign-extended 8-bit vs. sign-extended 32-bit displacements in addressing modes like [rsp + disp8]
are why the x86-64 System V ABI chose a 128-byte red-zone below RSP: it gives at most a ~256-byte are that can be accessed with more compact code-size, including the red-zone plus reserved space above RSP.
PS:
Note that you don't have to use the same memory location for the same high-level "variable" at every point in your function. You could spill/reload something to one location in one part of a function, and another location later in the function. IDK why you would, but if you have wasted space for alignment it's something you could do. Possibly if you expect one cache line to be hot early on (e.g. near the top of the stack frame on function entry), and another cache line to be hot later (near some other vars that were being used heavily).
A "variable" is a high-level concept you can implement however you like. This isn't C, there's no requirement that it have an address, or have the same address. (C compilers in practice will optimize variables into registers if the address isn't taken, or doesn't escape the function after inlining.)
This is kind of off-topic or at least a pedantic diversion; normally you do simply use the same memory location for the same thing consistently, when it can't be in a register.
Upvotes: 2