Significance of laying out stack variables starting nearer rsp than rbp

Question

This question is about x86 assembly but I provide an example in C because I tried to check what GCC was doing.

As I was following various assembly guides, I have noticed that people, at least the few whose materials I have been reading, seem to be in a habit of allocating stack variables closer to rsp than rbp.

I then checked what GCC would do and it seems to be the same.

In the disassembly below, first 0x10 bytes are reserved and then the result of calling leaf goes via eax to rbp-0xc and the constant value 2 goes to rbp-0x8, leaving room between rbp-0x8 and rbp for variable "q".

I could imagine doing it in the other direction, first assigning to an address at rbp and then at rbp-0x4, i.e. doing it in the direction of rbp to rsp, then leaving some space between rbp-0x8 and rsp for "q".

What I am not sure about is whether what I am observing is as things should be because of some architectural constraints that I better be aware of and adhere to or is it purely an artifact of this particular implementation and a manifestation of habits of the people whose code I read that I should not assign any significance to, e.g. this needs to be done in one direction or the other and it does not matter which one as long it is consistent.

Or perhaps I am just reading and writing trivial code for now and this will go both ways as I get to something more substantial in some time?

I would just like to know how I should go about it in my own assembly code.

All of this is on Linux 64-bit, GCC version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04). Thanks.

00000000000005fa :
 5fa:   55                      push   rbp
 5fb:   48 89 e5                mov    rbp,rsp
 5fe:   b8 01 00 00 00          mov    eax,0x1
 603:   5d                      pop    rbp
 604:   c3                      ret    

0000000000000605 :
 605:   55                      push   rbp
 606:   48 89 e5                mov    rbp,rsp
 609:   48 83 ec 10             sub    rsp,0x10
 60d:   b8 00 00 00 00          mov    eax,0x0
 612:   e8 e3 ff ff ff          call   5fa 
 617:   89 45 f4                mov    DWORD PTR [rbp-0xc],eax   ; // <--- This line
 61a:   c7 45 f8 02 00 00 00    mov    DWORD PTR [rbp-0x8],0x2   ; // <--  And this too
 621:   8b 55 f4                mov    edx,DWORD PTR [rbp-0xc]
 624:   8b 45 f8                mov    eax,DWORD PTR [rbp-0x8]
 627:   01 d0                   add    eax,edx
 629:   89 45 fc                mov    DWORD PTR [rbp-0x4],eax
 62c:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
 62f:   c9                      leave  
 630:   c3                      ret

Here is the C code:

int leaf() {
   return 1;
}

int myfunc() {
   int x = leaf(); // <--- This line
   int y = 2;      // <--  And this too
   int q = x + y;
   return q;
}

int main(int argc, char *argv[]) {
   return myfunc();
}

How I compile it:

gcc -O0 main.c -o main.bin

How I disassemble it:

objdump -d -j .text -M intel main.bin

Peter Cordes · Accepted Answer

It makes zero difference, do whichever you want for local variables that have to exist at all (because you can't optimize them into registers).

There is zero significance to what GCC is doing; it doesn't matter where the unused gap is (which exists because of stack alignment). In this case it's the 4 bytes at [rsp], aka [rbp - 0x10].
The 4 bytes at [rbp - 4] are used for q.

Also, you didn't tell GCC to optimize, so there's no reason to expect its choices to even be optimal or a useful guide to learn from. -O3 with volatile int locals would make more sense. (But since there's nothing significant going on, still not actually helpful.)

The things that matter:

Local vars should be naturally aligned (dword values at least 4-byte aligned). The C ABI requires this: alignof(int) = 4. RSP before a call will be 16-byte aligned, so on function entry RSP-8 is 16-byte aligned.
Code size: As many as possible of your addressing modes can use small (signed 8-bit) displacements¹ from RBP (or RSP if you address your locals relative to RSP like gcc -fomit-frame-pointer).

This is trivially the case when you only have a few scalar locals, nowhere near 128 bytes of them.
Any locals you can operate on together are adjacent, and preferably not crossing an alignment boundary, so you can most efficiently init them both / all with one qword or XMM store.

If you have a lot of locals (or an array), group them for spatial locality if there's one whole cache line that might be "cold" while this function (and its children) are running.
Spatial locality: variables you use earlier in your function should be higher in the stack frame (closer to the return address which was stored by the call to this function). The stack is typically hot in cache, but touching a new cache line of stack memory as it grows will be slightly less of an impact if its done after earlier loads/stores. Out-of-order exec can hopefully get to those later store instructions soon and get that cache-miss store into the pipeline to start an RFO (read for ownership) early, minimizing time spent with earlier loads clogging up the store buffer.

This only matters across boundaries wider than 16 bytes; you know everything within one 16-byte aligned chunk is in the same cache line.

A descending access pattern within one cache line might possibly trigger prefetch of the next cache line downward, but I'm not sure if that happens in real CPUs. If so, that might be a reason not to do this, and to favour storing first to the bottom of your stack frame (at RSP, or the lowest red-zone address you'll actually use).

If there's unused space for stack alignment before another call, it's usually only 8 bytes at most. That's much smaller than a cache line and thus doesn't have any significant impact on spatial locality of your local variables. You know the stack pointer alignment relative to a 16-byte boundary, so the choice of leaving padding at the top or bottom of your stack frame never makes a difference between potentially touching a new cache cache line or not.

If you're passing pointers to your locals to different threads, beware false sharing: probably separate those locals by at least 64 bytes so they'll be in different cache lines, or even better by 128 bytes (L2 spatial prefetcher can create "destructive interference" between adjacent cache lines).

Footnote 1: x86 sign-extended 8-bit vs. sign-extended 32-bit displacements in addressing modes like [rsp + disp8] are why the x86-64 System V ABI chose a 128-byte red-zone below RSP: it gives at most a ~256-byte are that can be accessed with more compact code-size, including the red-zone plus reserved space above RSP.

PS:

Note that you don't have to use the same memory location for the same high-level "variable" at every point in your function. You could spill/reload something to one location in one part of a function, and another location later in the function. IDK why you would, but if you have wasted space for alignment it's something you could do. Possibly if you expect one cache line to be hot early on (e.g. near the top of the stack frame on function entry), and another cache line to be hot later (near some other vars that were being used heavily).

A "variable" is a high-level concept you can implement however you like. This isn't C, there's no requirement that it have an address, or have the same address. (C compilers in practice will optimize variables into registers if the address isn't taken, or doesn't escape the function after inlining.)

This is kind of off-topic or at least a pedantic diversion; normally you do simply use the same memory location for the same thing consistently, when it can't be in a register.

Significance of laying out stack variables starting nearer rsp than rbp

Answers (1)

Related Questions