chasep255
chasep255

Reputation: 12175

Why does gcc pass char type in 8 byte format to function assembly

To learn assembly I am viewing the assembly generated by GCC using the -S command for some simple c programs. I have an add function which accepts some ints and some char and adds them together. I am just wondering why the char parameters are pushed onto the stack as 8 bytes (pushq)? Why not just push a single byte?

    .file   "test.c"
    .text
    .globl  add
    .type   add, @function
add:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    %edx, -12(%rbp)
    movl    %ecx, -16(%rbp)
    movl    %r8d, -20(%rbp)
    movl    %r9d, -24(%rbp)
    movl    16(%rbp), %ecx
    movl    24(%rbp), %edx
    movl    32(%rbp), %eax
    movb    %cl, -28(%rbp)
    movb    %dl, -32(%rbp)
    movb    %al, -36(%rbp)
    movl    -4(%rbp), %edx
    movl    -8(%rbp), %eax
    addl    %eax, %edx
    movl    -12(%rbp), %eax
    addl    %eax, %edx
    movl    -16(%rbp), %eax
    addl    %eax, %edx
    movl    -20(%rbp), %eax
    addl    %eax, %edx
    movl    -24(%rbp), %eax
    addl    %eax, %edx
    movsbl  -28(%rbp), %eax
    addl    %eax, %edx
    movsbl  -32(%rbp), %eax
    addl    %eax, %edx
    movsbl  -36(%rbp), %eax
    addl    %edx, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   add, .-add
    .globl  main
    .type   main, @function
main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    pushq   $9
    pushq   $8
    pushq   $7
    movl    $6, %r9d
    movl    $5, %r8d
    movl    $4, %ecx
    movl    $3, %edx
    movl    $2, %esi
    movl    $1, %edi
    call    add
    addq    $24, %rsp
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE1:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 4.9.2-10ubuntu13) 4.9.2"
    .section    .note.GNU-stack,"",@progbits
#include <stdio.h>

int add(int a, int b, int c, int d, int e, int f, char g, char h, char i)
{
    return a + b + c + d + e + f + g + h + i;
}

int main()
{
    return add(1, 2, 3, 4, 5, 6, 7, 8, 9);
}

Upvotes: 5

Views: 847

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364068

GCC is following the x86-64 System V ABI, which was designed to use stack slots the same width as full registers, like earlier 16 and 32-bit x86 calling conventions, and like typical ABIs for other machines.

This has a few advantages, like making it possible to use push fairly simply for stack args, even with a mix of constants and variables. And being more robust in some cases when code is sloppy about arg types for variadic (like printf) or unprototyped functions (e.g. implicit function declarations, or K&R-style declarations like int add(); without an arg list, either of which were maybe still a consideration in 2000 when the x86-64 SysV ABI was designed publicly on a mailing list whose archives give some insight into stuff like the choice of those six arg-passing registers.) Sloppy C and C++ was more common than hopefully now.

x86-64 System V is an LP64 ABI (Long and Pointers are 64-bit), so code that passes a long to a function expecting an int (variadic like printf("%d %d", sizeof(foo), my_long), or no prototype) would break later args in a hypothetical calling-convention where 64-bit long took more space than 32-bit int.

(Such a design would be possible, with stack args laid out following the struct rules above RSP, so each is aligned by its alignof(). Valid ISO C programs without UB would still work. POSIX's numbered printf stuff like printf(%2$Lf %1$d\n", 1, (long double)2.2) at first looks like it needs more functionality, but each arg must be mentioned by the format string so a possible implementation is to just retrieve or index them all in sequence. Even x86-64 System V doesn't make an array of fixed-width args that you can easily index.)

An implicit declaration of your add (e.g. if main was in a separate file without a prototype) would also break in a narrow-arg-packing convention even if you passed the last 3 args as (char)7.
C default argument promotion for variadic and unprototyped functions will expand char and other narrow integer types to int, which actually still works on 2's complement machines like x86 even if the caller is expecting char1, if the calling convention pads narrow args to at least int width. (ISO C doesn't define the behaviour, but unless -flto sees the mismatch, the calling convention effectively does define the behaviour.)

On 32-bit machines, int was full register width, so there's the precedent for using full register width. And long was generally also that width, so some code was probably sloppy about mixing long and int, but would have broken on existing 32-bit systems if sloppy in the same way about long long. Making the arg-passing slots as wide as long in x86-64 System V lets some functions still happen to work that wouldn't have with int-sized slots.

This normally isn't a disaster in terms of wasting stack space or taking too many instructions (multiple push imm8 instead of one one push imm32): Most functions get most of their args in registers (and can pack them if they want/need to spill them, e.g. if they have to call another function before using them). The C++ ABI does force "non-trivial" (constructor/destructor) class types to be passed in memory, though. (e.g. Why can a T* be passed in register, but a unique_ptr<T> cannot?) The code in your example add callee is terrible because you didn't enable optimization, but the code in your main looks normal.


x86-64 System V details and stated motivation

See https://gitlab.com/x86-psABIs/x86-64-ABI for a copy of the current version of the spec. See also the tag wiki for links to ABIs (and much more good stuff.) From page 23 of the ABI PDF:

Classification The size of each argument gets rounded up to eightbytes*.
(*footnote: Therefore the stack will always be eightbyte aligned).

That's not the most convincing justification. I think the hypothetical situation where narrow args would temporarily misalign RSP is that you'd still push narrow args one at a time, perhaps with pushw $1 for 2-byte args (or pushw $0x0201 for two 1-byte args?), or even emulating dword pushes2 with sub $4, %rsp / mov %eax, (%rsp) or dec/mov for bytes. That would obviously be silly, you'd just move RSP once and use mov to store args. Or like GCC -maccumulate-outgoing-args to only move RSP once for the whole function, reusing arg-passing space3.

And you still have to end up with RSP % 16 == 0 before the call - (pg 16: Stack Frame):

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point.

If they'd designed it so different integer types had different widths on the stack, but 8-byte types were still always 8-byte aligned, there would be slightly more complicated rules about where the padding goes, (and thus where the called function finds its args) depending on the types of current and previous args.

And va_arg for variadic functions like printf would be even more complex, needing to round up a pointer to the next alignment boundary for the type in case the previous arg ended at an insufficiently-aligned location.


Footnote 1 - char/short/int compatibility

The low byte of an int8_t or uint8_t sign or zero-extended to int is still the same int8_t or uint8_t object-representation, for 2's complement or unsigned.

So the caller extending it because of default arg promotions doesn't break anything as long as it's still passed in one fixed-width slot. (ISO C doesn't define the behaviour but I'm sure in practice it wasn't rare for old codebases to depend on this.) This isn't true for float/double, but lots of functions only use double if they use FP at all.

A longer explanation with more background follows; feel free to skip it.

With a function defined like int add(many args, char g, char h, char i) (or the K&R equivalent which doesn't act as a prototype), the callee knows it's only looking for narrow args. But callers that don't have a prototype (perhaps only a declaration like int add();, or worse an implicit declaration) will follow C's default argument promotions so (char)1 implicitly becomes (int)(char)1. (C11 N1570 6.5.2.2 / 6).

Such a caller has to assume the callee was written to take an int arg at that position, so it definitely needs to pass a full int.

ISO C doesn't define the behaviour when the callee expects char but the caller (after default argument promotions) passes int. (That's why it's putchar(int) for example; the stdio package predates prototypes and even the preprocessor.) However, most calling conventions do define the behaviour for narrow integer types, at least when -flto doesn't let the compiler see you breaking the rules and decide this code path must be unreachable or something.

A float arg won't work for functions with unprototyped callers: they'll promote it to double. The low 32 bits of a double interpreted as an IEEE binary32 float bit-pattern don't represent the same number. (Often 0.0 for round numbers like 3.0). So functions with unprototyped callers can't take a float, even with a slot width of 8 bytes so later args are in the right position. (Same for variadic args, which is why printf has no conversion for float, because any caller will promote it to double.)

But 2's complement and unsigned integer types sign/zero extend just by tacking on copies of the sign bit, or zeros respectively. This doesn't change the bits of the narrower integer, so those low bits are still a valid signed char or unsigned short. A caller that was expecting int8_t will still read the right value whether the caller wrote only a byte or a whole dword with extra copies of the sign bit above it.

Note that args in x86-64 System V can contain garbage outside the actual C type width, in a register or the stack slot. Same for return values. (This is why compilers sometimes do movsxd or mov edx, edi for int or unsigned args before using them to index an array. Actually I'm being over-optimistic: usually they stupidly do mov edi,edi which defeats mov-elimination. So size_t args can be better in functions that don't inline.)
Clang relies on an undocumented extension which GCC also implements for the caller side but doesn't rely on, where narrow args are extended to 32-bit, as if by default argument promotion, in both registers and stack memory, for functions with prototypes. So even callers with prototypes should extend args to 32-bit when calling into clang-compiled code. This is often something you want to do anyway, e.g. movzx for byte loads from memory.

The key point here is that a call to int add(a,b) : char a,b; {return a+b;} (K&R style declaration) with no prototype anywhere will result in the caller writing 4-byte args (in 8-byte slots in x86-64 System V) and the caller reading 1-byte args from the correct places (the least-significant bytes of what the caller wrote), so it all Just Works. For registers or for later args on the stack.

But if add looked for its two char args in adjacent bytes, that would only work for callers that saw a prototype and placed their args accordingly, different from int args.

On the other hand, long foo(a, b) long a,b; {return a+b} with a caller that does foo(1, (int)x) (without seeing a prototype) is not guaranteed to work without a prototype. (Even in practice on x86-64, even with the unofficial extension Clang relies on: that's only to 32-bit). If the caller uses mov to store args to memory (like gcc -maccumulate-outgoing-args), it's probably only going to write a dword for each, leaving the garbage in the high bytes that the caller is going to read. (For register args, often those will be zero-extended to 64-bit from writing the 32-bit register (because that's how the x86-64 ISA works) so that already doesn't work for signed negative values, and you can construct cases where the compiler can leave garbage in the high 32 bits of RDI when calling a function that wants an int.) Other than K&R definitions, if headers only have long foo(); or nothing, then from other compilation units callers are in the same boat even if some other .c used long foo(long a, long b){...}.

Default argument promotion up to int saves the day for passing char or int to functions that wanted int or char even without prototypes, but not for functions that want args wider than int. (This includes long for x86-64 System V, but not for Windows x64 which chose an LLP64 ABI: only long long and pointers are 64-bit. But includes size_t for both.)

Footnote 2 - operand sizes for push

8-bit pushes are not encodable at all, in any mode. In 64-bit mode, only 16-bit (with a 0x66 prefix), or 64-bit (no prefix, or REX.W=1) are available.

The "description" and pseudocode part of Intel's manual entry are confusing on this, saying that REX can override the operand-size and having a clause for 32-bit operand-size with 64-bit address-size. But push dword is not encodeable in 64-bit mode even with REX.W=0: the table at the top is correct, the "RSP := RSP – 4" in the pseudocode is unreachable. See How many bytes does the push instruction push onto the stack when I don't specify the operand size?.

Footnote 3: push vs. mov for storing args:

GCC -maccumulate-outgoing-args (using mov) was generally faster anyway when AMD64 was new, before Pentium M and later AMD had a stack engine to avoid uops that update RSP. But larger code-size in bytes, since push reg is a 1-byte instruction. And code-size both for L1i hit rate and fetch/decode bandwidth was a factor on early CPUs before uop caches.

x86-64 System V's actual design does make push reg and push imm8/imm32 fairly efficiently usable for functions with so many args that they need some on the stack (or C++ functions with non-POD args that must go in memory). Since you need the last push to leave RSP % 16 == 0, you might need a dummy push or something to reach the right starting point for however many args you're going to push.

(Unlike Windows x64 with shadow space below the stack args, where you normally don't use push. And apparently stack-unwind info doesn't want you to change RSP except in function prologue or epilogue, unless you're using RBP as a frame pointer to allow alloca.)


PS: This got way longer than I intended, and I had a hard time boiling it down. It could probably use a proof-read; let me know if there's any dangling sentence fragments or duplicated points. (There's a bunch of first-draft stuff in an HTML comment in case I come back to it.)

Other Q&As on the subject:

Upvotes: 11

David Hoelzer
David Hoelzer

Reputation: 16331

When pushing values onto the stack, the push must always be based on the word size of the system. If you're an old timer like me, that's 16 bits (though I do have some 12 bit word size systems!), but it really is system dependent.

Since you're talking about X86_64, you will be talking about 64 bit words. My understanding is that the word size is typically connected to the minimum number of bytes required to address any value on the RAM of the system. Since you have a 64 bit memory space, a 64 bit (or 8 bytes, a "quad word" based on the original 16 bit word size) is required.

Upvotes: 3

Related Questions