Reputation: 718
I am trying to learn how assembly works at an elementary level and so I have been playing with the -S
output of gcc compilations. I wrote a simple program that defines two bytes and returns their sum. The entire program follows:
int main(void) {
char A = 5;
char B = 10;
return A + B;
}
When I compile this with no optimizations using:
gcc -O0 -S -c test.c
I get test.s that looks like the following:
.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $16, %esp
call ___main
movb $5, 15(%esp)
movb $10, 14(%esp)
movsbl 15(%esp), %edx
movsbl 14(%esp), %eax
addl %edx, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE0:
.ident "GCC: (GNU) 4.9.2"
Now, recognizing that this program can very easily be simplified to just return a constant (15) I have been able to reduce the assembly by hand to perform the same function using this code:
.global _main
_main:
movl $15, %eax
ret
This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?
Why is the initial output of GCC so much more verbose? What do the lines spanning from .cfi_startproc
to call __main
even do? What does call __main
do? I cannot figure what the two subtraction operations are for.
Even with optimizations in GCC set to -O3
I get this:
.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.section .text.unlikely,"x"
LCOLDB0:
.section .text.startup,"x"
LHOTB0:
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
call ___main
movl $15, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE0:
.section .text.unlikely,"x"
LCOLDE0:
.section .text.startup,"x"
LHOTE0:
.ident "GCC: (GNU) 4.9.2"
Which seems to have removed a number of operations, but still leaves all the lines leading to call __main
that seems unnecessary. What are all the .cfi_XXX
lines for? Why are so many labels added? What do .section
, .ident
, .def .p2align
, etc. do?
I understand that many of the labels and symbols are included for debugging, but shouldn't these be stripped or omitted if I am not compiling with -g enabled?
UPDATE
To clarify, by saying
This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?
I am not suggesting that I am trying to, or have achieved, an optimized version of this program. I realize the program is useless and trivial. I am just using it as a tool to learn assembly and how the compiler works.
The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added a lot of "stuff" whose purpose I cannot discern.
Upvotes: 10
Views: 1947
Reputation: 363980
Thank you, Kin3TiX, for asking an asm-newbie question that wasn't just a code-dump of some nasty code with no comments, and a really simple problem. :)
As a way to get your feet wet with ASM, I'd suggest working with functions OTHER than main
. e.g. just a function that takes two integer args, and adds them. Then the compiler can't optimize it away. You can still call it with constants as args, and if it's in a different file from main
, it won't get inlined, so you can even single-step through it.
There's some benefit to understanding what's going on at the asm level when you compile main
, but other than embedded systems, you're only ever going to write optimized inner loops in asm. IMO, there's little point using asm if you aren't going to optimize the hell out of it. Otherwise you probably won't beat compiler output from source which is much easier to read.
Other tips for understanding compiler output: compile with
gcc -S -fno-stack-check -fverbose-asm
. The comments after each instruction are often nice reminders of what that load was for. Pretty soon it degenerates into a mess of temporaries with names like D.2983
, but something like
movq 8(%rdi), %rcx # a_1(D)->elements, a_1(D)->elements
will save you a round-trip to the ABI reference to see which function arg comes in in %rdi
, and which struct member is at offset 8.
See also How to remove "noise" from GCC/clang assembly output?
What do the lines spanning from .cfi_startproc to call__main even do?
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
.cfi
stuff is stack-unwind info for debuggers (and C++ exception handling) to unwind the stack
It won't be there if you look at asm from objdump -d
output instead of gcc -S
, or you can use -fno-asynchronous-unwind-tables
.
The stuff with pushing %ebp
and then setting it to the value of the stack pointer on function entry sets up what's called a "stack frame". This is why %ebp
is called the base pointer. These insns won't be there if you compile with -fomit-frame-pointer
, which gives code an extra register to work with. That's on by default at -O2
. (This is huge for 32bit x86, since that takes you from 6 to 7 usable regs. (%esp
is still tied up being the stack pointer; stashing it temporarily in an xmm or mmx reg and then using it as another GP reg is possible in theory, but compilers will never do that and it makes async stuff like POSIX signals or Windows SEH unusable, as well as making debugging harder.)
The leave
instruction before the ret
is also part of this stack frame stuff.
Frame pointers are mostly historical baggage, but do make offsets into the stack frame consistent. With debug symbols, you can backtrace the call stack just fine even with -fomit-frame-pointer
, and it's the default for amd64. (The amd64 ABI has alignment requirements for the stack, is a LOT better in other ways, too. e.g. passes args in regs instead of on the stack.)
andl $-16, %esp
subl $16, %esp
The and
aligns the stack to a 16-byte boundary, regardless of what it was before. The sub
reserves 16 bytes on the stack for this function. (Notice how it's missing from the optimized version, because it optimizes away any need for memory storage of any variables.)
call ___main
__main
(asm name = ___main
) is part of cygwin: it calls constructor / init functions for shared libraries (including libc). On GNU/Linux, this is handled by _start
(before main is reached) and even dynamic-linker hooks that let libc initialize itself before the executable's own _start
is even reached. I've read that dynamic-linker hooks (or _start
from a static executable) instead of code in main
would be possible under Cygwin, but they simply choose not to do it that way.
(This old mailing list message indicates _main
is for constructors, but that main shouldn't have to call it on platforms that support getting the startup code to call it.)
movb $5, 15(%esp)
movb $10, 14(%esp)
movsbl 15(%esp), %edx
movsbl 14(%esp), %eax
addl %edx, %eax
leave
ret
Why is the initial output of GCC so much more verbose?
Without optimizations enabled, gcc maps C statements as literally as possible into asm. Doing anything else would take more compile time. Thus, movb
is from the initializers for your two variables. The return value is computed by doing two loads (with sign extension, because we need to upconvert to int BEFORE the add, to match the semantics of the C code as written, as far as overflow).
I cannot figure what the two subtraction operations are for.
There is only one sub
instruction. It reserves space on the stack for the function's variables, before the call to __main
. Which other sub are you talking about?
What do .section, .ident, .def .p2align, etc. etc. do?
See the manual for the GNU assembler. Also available locally as info pages: run info gas
.
.ident
and .def
: Looks like gcc putting its stamp on the object file, so you can tell what compiler / assembler produced it. Not relevant, ignore these.
.section
: determines what section of the ELF object file the bytes from all following instructions or data directives (e.g. .byte 0x00
) go into, until the next .section
assembler directive. Either code
(read-only, shareable), data
(initialized read/write data, private), or bss
(block storage segment. zero-initialized, doesn't take any space in the object file).
.p2align
: Power of 2 Align. Pad with nop instructions until the desired alignment. .align 16
is the same as .p2align 4
. Jump instruction are faster when the target is aligned, because of instruction fetch in chunks of 16B, not crossing a page boundary, or just not crossing a cache-line boundary. (32B alignment is relevant when code is already in the uop cache of an Intel Sandybridge and later.) See Agner Fog's docs, for example.
The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.
Put the code of interest in a function by itself. A lot of things are special about main
.
You are correct that a mov
-immediate and a ret
are all that's needed to implement the function, but gcc apparently doesn't have shortcuts for recognizing trivial whole-programs and omitting main
's stack frame or the call to _main
. >.<
Good question, though. As I said, just ignore all that crap and worry about just the small part you want to optimize.
Upvotes: 10
Reputation: 57764
The -o0
option directs the output to a file named 0
. Maybe you meant the optimization level (which is capital O)?: that disables optimizations.
I don't understand why there would be a call to ____main
unless this was produced for some emulated or hooked environment. When I compile with gcc -O0 -c -S t.c
, I get:
.file "t.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movb $5, -2(%rbp)
movb $10, -1(%rbp)
movsbl -2(%rbp), %edx
movsbl -1(%rbp), %eax
leal (%rdx,%rax), %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-11)"
.section .note.GNU-stack,"",@progbits
Perhaps you were expecting a high level of optimization? This is what I get with gcc -O3 -c -S t.c
:
.file "t.c"
.text
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
movl $15, %eax
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-11)"
.section .note.GNU-stack,"",@progbits
Except for the debug information, it is about as short as it can be. The same code is produced for gcc -O2 -c -S t.c
and gcc -O1 -c -S t.c
. That is, the slightest optimization evaluates all the constants at compile time.
Upvotes: 1
Reputation: 9805
.cfi
(call frame information) directives are used in gas
(Gnu ASsembler) mainly for debugging. They allow the debugger to unwind the stack. To disable them, you can use the following parameter when you invoke the compilation driver -fno-asynchronous-unwind-tables
.
If you want to play with the compiler in general, you can use the following compilation driver invocation command -o <filename.S> -S -masm=intel -fno-asynchronous-unwind-tables <filename.C>
or just use godbolt's interactive compiler
Upvotes: 6
Reputation: 179392
First off, the CFI stuff is there for debugging purposes (and, in C++, exception handling). It tells the debugger what the stack frame looks like at each instruction, so that the debugger can reconstruct the state of the program's variables. Those don't result in executable statements, and will have zero effect on a program's runtime performance.
I don't know what the call to __main
is doing there - my GCC doesn't do that. In fact, my GCC (4.9.2) gives me the following for gcc test.c -S -O1
:
.section __TEXT,__text_startup,regular,pure_instructions
.globl _main
_main:
LFB0:
movl $15, %eax
ret
LFE0:
.section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
.set L$set$0,LECIE1-LSCIE1
.long L$set$0
LSCIE1:
.long 0
.byte 0x1
.ascii "zR\0"
.byte 0x1
.byte 0x78
.byte 0x10
.byte 0x1
.byte 0x10
.byte 0xc
.byte 0x7
.byte 0x8
.byte 0x90
.byte 0x1
.align 3
LECIE1:
LSFDE1:
.set L$set$1,LEFDE1-LASFDE1
.long L$set$1
LASFDE1:
.long LASFDE1-EH_frame1
.quad LFB0-.
.set L$set$2,LFE0-LFB0
.quad L$set$2
.byte 0
.align 3
LEFDE1:
.subsections_via_symbols
and would you look at that, _main
is exactly the two-instruction sequence you expected. (The __eh_frame
stuff is more debugging information in a different format).
Upvotes: 1
Reputation: 26358
I think that part is just a fixed pattern that sets up a 16-byte aligned stack and the CFI is exception frame handling related.
Determining that those are not needed for any main() is hard since that is a global optimization because main might call functions in other compilation units.
And it is probably not worthwhile spending the time to optimize this trivial and fairly useless case.
If you feel otherwise, you can always start working on such an optimization and submit it to gcc.
Upvotes: 0