Reputation: 12175
To learn assembly I am viewing the assembly generated by GCC using the -S command for some simple c programs. I have an add function which accepts some ints and some char and adds them together. I am just wondering why the char parameters are pushed onto the stack as 8 bytes (pushq)? Why not just push a single byte?
.file "test.c"
.text
.globl add
.type add, @function
add:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movl %ecx, -16(%rbp)
movl %r8d, -20(%rbp)
movl %r9d, -24(%rbp)
movl 16(%rbp), %ecx
movl 24(%rbp), %edx
movl 32(%rbp), %eax
movb %cl, -28(%rbp)
movb %dl, -32(%rbp)
movb %al, -36(%rbp)
movl -4(%rbp), %edx
movl -8(%rbp), %eax
addl %eax, %edx
movl -12(%rbp), %eax
addl %eax, %edx
movl -16(%rbp), %eax
addl %eax, %edx
movl -20(%rbp), %eax
addl %eax, %edx
movl -24(%rbp), %eax
addl %eax, %edx
movsbl -28(%rbp), %eax
addl %eax, %edx
movsbl -32(%rbp), %eax
addl %eax, %edx
movsbl -36(%rbp), %eax
addl %edx, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size add, .-add
.globl main
.type main, @function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq $9
pushq $8
pushq $7
movl $6, %r9d
movl $5, %r8d
movl $4, %ecx
movl $3, %edx
movl $2, %esi
movl $1, %edi
call add
addq $24, %rsp
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 4.9.2-10ubuntu13) 4.9.2"
.section .note.GNU-stack,"",@progbits
#include <stdio.h>
int add(int a, int b, int c, int d, int e, int f, char g, char h, char i)
{
return a + b + c + d + e + f + g + h + i;
}
int main()
{
return add(1, 2, 3, 4, 5, 6, 7, 8, 9);
}
Upvotes: 5
Views: 847
Reputation: 364068
GCC is following the x86-64 System V ABI, which was designed to use stack slots the same width as full registers, like earlier 16 and 32-bit x86 calling conventions, and like typical ABIs for other machines.
This has a few advantages, like making it possible to use push
fairly simply for stack args, even with a mix of constants and variables. And being more robust in some cases when code is sloppy about arg types for variadic (like printf
) or unprototyped functions (e.g. implicit function declarations, or K&R-style declarations like int add();
without an arg list, either of which were maybe still a consideration in 2000 when the x86-64 SysV ABI was designed publicly on a mailing list whose archives give some insight into stuff like the choice of those six arg-passing registers.) Sloppy C and C++ was more common than hopefully now.
x86-64 System V is an LP64 ABI (Long and Pointers are 64-bit), so code that passes a long
to a function expecting an int
(variadic like printf("%d %d", sizeof(foo), my_long)
, or no prototype) would break later args in a hypothetical calling-convention where 64-bit long
took more space than 32-bit int
.
(Such a design would be possible, with stack args laid out following the struct rules above RSP, so each is aligned by its alignof()
. Valid ISO C programs without UB would still work. POSIX's numbered printf stuff like printf(%2$Lf %1$d\n", 1, (long double)2.2)
at first looks like it needs more functionality, but each arg must be mentioned by the format string so a possible implementation is to just retrieve or index them all in sequence. Even x86-64 System V doesn't make an array of fixed-width args that you can easily index.)
An implicit declaration of your add
(e.g. if main
was in a separate file without a prototype) would also break in a narrow-arg-packing convention even if you passed the last 3 args as (char)7
.
C default argument promotion for variadic and unprototyped functions will expand char
and other narrow integer types to int
, which actually still works on 2's complement machines like x86 even if the caller is expecting char
1, if the calling convention pads narrow args to at least int
width. (ISO C doesn't define the behaviour, but unless -flto
sees the mismatch, the calling convention effectively does define the behaviour.)
On 32-bit machines, int
was full register width, so there's the precedent for using full register width. And long
was generally also that width, so some code was probably sloppy about mixing long
and int
, but would have broken on existing 32-bit systems if sloppy in the same way about long long
. Making the arg-passing slots as wide as long
in x86-64 System V lets some functions still happen to work that wouldn't have with int
-sized slots.
This normally isn't a disaster in terms of wasting stack space or taking too many instructions (multiple push imm8
instead of one one push imm32
): Most functions get most of their args in registers (and can pack them if they want/need to spill them, e.g. if they have to call another function before using them). The C++ ABI does force "non-trivial" (constructor/destructor) class types to be passed in memory, though. (e.g. Why can a T* be passed in register, but a unique_ptr<T> cannot?) The code in your example add
callee is terrible because you didn't enable optimization, but the code in your main
looks normal.
See https://gitlab.com/x86-psABIs/x86-64-ABI for a copy of the current version of the spec. See also the x86 tag wiki for links to ABIs (and much more good stuff.) From page 23 of the ABI PDF:
Classification The size of each argument gets rounded up to eightbytes*.
(*footnote: Therefore the stack will always be eightbyte aligned).
That's not the most convincing justification. I think the hypothetical situation where narrow args would temporarily misalign RSP is that you'd still push
narrow args one at a time, perhaps with pushw $1
for 2-byte args (or pushw $0x0201
for two 1-byte args?), or even emulating dword pushes2 with sub $4, %rsp
/ mov %eax, (%rsp)
or dec/mov for bytes. That would obviously be silly, you'd just move RSP once and use mov
to store args. Or like GCC -maccumulate-outgoing-args
to only move RSP once for the whole function, reusing arg-passing space3.
And you still have to end up with RSP % 16 == 0
before the call
- (pg 16: Stack Frame):
The end of the input argument area shall be aligned on a 16 (32, if
__m256
is passed on stack) byte boundary. In other words, the value (%rsp + 8
) is always a multiple of 16 (32) when control is transferred to the function entry point.
If they'd designed it so different integer types had different widths on the stack, but 8-byte types were still always 8-byte aligned, there would be slightly more complicated rules about where the padding goes, (and thus where the called function finds its args) depending on the types of current and previous args.
And va_arg
for variadic functions like printf would be even more complex, needing to round up a pointer to the next alignment boundary for the type in case the previous arg ended at an insufficiently-aligned location.
char
/short
/int
compatibilityThe low byte of an int8_t
or uint8_t
sign or zero-extended to int
is still the same int8_t
or uint8_t
object-representation, for 2's complement or unsigned.
So the caller extending it because of default arg promotions doesn't break anything as long as it's still passed in one fixed-width slot. (ISO C doesn't define the behaviour but I'm sure in practice it wasn't rare for old codebases to depend on this.) This isn't true for float
/double
, but lots of functions only use double
if they use FP at all.
A longer explanation with more background follows; feel free to skip it.
With a function defined like int add(many args, char g, char h, char i)
(or the K&R equivalent which doesn't act as a prototype), the callee knows it's only looking for narrow args. But callers that don't have a prototype (perhaps only a declaration like int add();
, or worse an implicit declaration) will follow C's default argument promotions so (char)1
implicitly becomes (int)(char)1
. (C11 N1570 6.5.2.2 / 6).
Such a caller has to assume the callee was written to take an int
arg at that position, so it definitely needs to pass a full int
.
ISO C doesn't define the behaviour when the callee expects char
but the caller (after default argument promotions) passes int
. (That's why it's putchar(int)
for example; the stdio package predates prototypes and even the preprocessor.) However, most calling conventions do define the behaviour for narrow integer types, at least when -flto
doesn't let the compiler see you breaking the rules and decide this code path must be unreachable or something.
A float
arg won't work for functions with unprototyped callers: they'll promote it to double
. The low 32 bits of a double
interpreted as an IEEE binary32 float
bit-pattern don't represent the same number. (Often 0.0 for round numbers like 3.0). So functions with unprototyped callers can't take a float
, even with a slot width of 8 bytes so later args are in the right position. (Same for variadic args, which is why printf
has no conversion for float
, because any caller will promote it to double
.)
But 2's complement and unsigned integer types sign/zero extend just by tacking on copies of the sign bit, or zeros respectively. This doesn't change the bits of the narrower integer, so those low bits are still a valid signed char
or unsigned short
. A caller that was expecting int8_t
will still read the right value whether the caller wrote only a byte or a whole dword with extra copies of the sign bit above it.
Note that args in x86-64 System V can contain garbage outside the actual C type width, in a register or the stack slot. Same for return values. (This is why compilers sometimes do movsxd
or mov edx, edi
for int or unsigned args before using them to index an array. Actually I'm being over-optimistic: usually they stupidly do mov edi,edi
which defeats mov-elimination. So size_t
args can be better in functions that don't inline.)
Clang relies on an undocumented extension which GCC also implements for the caller side but doesn't rely on, where narrow args are extended to 32-bit, as if by default argument promotion, in both registers and stack memory, for functions with prototypes. So even callers with prototypes should extend args to 32-bit when calling into clang-compiled code. This is often something you want to do anyway, e.g. movzx
for byte loads from memory.
The key point here is that a call to int add(a,b) : char a,b; {return a+b;}
(K&R style declaration) with no prototype anywhere will result in the caller writing 4-byte args (in 8-byte slots in x86-64 System V) and the caller reading 1-byte args from the correct places (the least-significant bytes of what the caller wrote), so it all Just Works. For registers or for later args on the stack.
But if add
looked for its two char
args in adjacent bytes, that would only work for callers that saw a prototype and placed their args accordingly, different from int
args.
On the other hand, long foo(a, b) long a,b; {return a+b}
with a caller that does foo(1, (int)x)
(without seeing a prototype) is not guaranteed to work without a prototype. (Even in practice on x86-64, even with the unofficial extension Clang relies on: that's only to 32-bit). If the caller uses mov
to store args to memory (like gcc -maccumulate-outgoing-args
), it's probably only going to write a dword for each, leaving the garbage in the high bytes that the caller is going to read. (For register args, often those will be zero-extended to 64-bit from writing the 32-bit register (because that's how the x86-64 ISA works) so that already doesn't work for signed negative values, and you can construct cases where the compiler can leave garbage in the high 32 bits of RDI when calling a function that wants an int
.) Other than K&R definitions, if headers only have long foo();
or nothing, then from other compilation units callers are in the same boat even if some other .c
used long foo(long a, long b){...}
.
Default argument promotion up to int
saves the day for passing char
or int
to functions that wanted int
or char
even without prototypes, but not for functions that want args wider than int
. (This includes long
for x86-64 System V, but not for Windows x64 which chose an LLP64 ABI: only long long
and pointers are 64-bit. But includes size_t
for both.)
push
8-bit pushes are not encodable at all, in any mode. In 64-bit mode, only 16-bit (with a 0x66
prefix), or 64-bit (no prefix, or REX.W=1
) are available.
The "description" and pseudocode part of Intel's manual entry are confusing on this, saying that REX can override the operand-size and having a clause for 32-bit operand-size with 64-bit address-size. But push dword
is not encodeable in 64-bit mode even with REX.W=0: the table at the top is correct, the "RSP := RSP – 4" in the pseudocode is unreachable. See
How many bytes does the push instruction push onto the stack when I don't specify the operand size?.
GCC -maccumulate-outgoing-args
(using mov
) was generally faster anyway when AMD64 was new, before Pentium M and later AMD had a stack engine to avoid uops that update RSP. But larger code-size in bytes, since push reg
is a 1-byte instruction. And code-size both for L1i hit rate and fetch/decode bandwidth was a factor on early CPUs before uop caches.
x86-64 System V's actual design does make push reg
and push imm8/imm32
fairly efficiently usable for functions with so many args that they need some on the stack (or C++ functions with non-POD args that must go in memory). Since you need the last push to leave RSP % 16 == 0
, you might need a dummy push or something to reach the right starting point for however many args you're going to push.
(Unlike Windows x64 with shadow space below the stack args, where you normally don't use push
. And apparently stack-unwind info doesn't want you to change RSP except in function prologue or epilogue, unless you're using RBP as a frame pointer to allow alloca.)
PS: This got way longer than I intended, and I had a hard time boiling it down. It could probably use a proof-read; let me know if there's any dangling sentence fragments or duplicated points. (There's a bunch of first-draft stuff in an HTML comment in case I come back to it.)
Other Q&As on the subject:
word
).Upvotes: 11
Reputation: 16331
When pushing values onto the stack, the push must always be based on the word size of the system. If you're an old timer like me, that's 16 bits (though I do have some 12 bit word size systems!), but it really is system dependent.
Since you're talking about X86_64, you will be talking about 64 bit words. My understanding is that the word size is typically connected to the minimum number of bytes required to address any value on the RAM of the system. Since you have a 64 bit memory space, a 64 bit (or 8 bytes, a "quad word" based on the original 16 bit word size) is required.
Upvotes: 3