Reputation: 718

GCC Assembly Optimizations - Why are these equivalent?

I am trying to learn how assembly works at an elementary level and so I have been playing with the -S output of gcc compilations. I wrote a simple program that defines two bytes and returns their sum. The entire program follows:

int main(void) {
  char A = 5;
  char B = 10;
  return A + B;
}

When I compile this with no optimizations using:

gcc -O0 -S -c test.c

I get test.s that looks like the following:

    .file   "test.c"
    .def    ___main;    .scl    2;  .type   32; .endef
    .text
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $16, %esp
    call    ___main
    movb    $5, 15(%esp)
    movb    $10, 14(%esp)
    movsbl  15(%esp), %edx
    movsbl  14(%esp), %eax
    addl    %edx, %eax
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
LFE0:
    .ident  "GCC: (GNU) 4.9.2"

Now, recognizing that this program can very easily be simplified to just return a constant (15) I have been able to reduce the assembly by hand to perform the same function using this code:

.global _main
_main:
    movl    $15, %eax
    ret

This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?

Why is the initial output of GCC so much more verbose? What do the lines spanning from .cfi_startproc to call __main even do? What does call __main do? I cannot figure what the two subtraction operations are for.

Even with optimizations in GCC set to -O3 I get this:

    .file   "test.c"
    .def    ___main;    .scl    2;  .type   32; .endef
    .section    .text.unlikely,"x"
LCOLDB0:
    .section    .text.startup,"x"
LHOTB0:
    .p2align 4,,15
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    call    ___main
    movl    $15, %eax
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
LFE0:
    .section    .text.unlikely,"x"
LCOLDE0:
    .section    .text.startup,"x"
LHOTE0:
    .ident  "GCC: (GNU) 4.9.2"

Which seems to have removed a number of operations, but still leaves all the lines leading to call __main that seems unnecessary. What are all the .cfi_XXX lines for? Why are so many labels added? What do .section, .ident, .def .p2align, etc. do?

I understand that many of the labels and symbols are included for debugging, but shouldn't these be stripped or omitted if I am not compiling with -g enabled?

UPDATE

To clarify, by saying

This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?

I am not suggesting that I am trying to, or have achieved, an optimized version of this program. I realize the program is useless and trivial. I am just using it as a tool to learn assembly and how the compiler works.

The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added a lot of "stuff" whose purpose I cannot discern.

Upvotes: 10

Answers (5)

Peter Cordes

Reputation: 365832

Thank you, Kin3TiX, for asking an asm-newbie question that wasn't just a code-dump of some nasty code with no comments, and a really simple problem. :)

As a way to get your feet wet with ASM, I'd suggest working with functions OTHER than main. e.g. just a function that takes two integer args, and adds them. Then the compiler can't optimize it away. You can still call it with constants as args, and if it's in a different file from main, it won't get inlined, so you can even single-step through it.

There's some benefit to understanding what's going on at the asm level when you compile main, but other than embedded systems, you're only ever going to write optimized inner loops in asm. IMO, there's little point using asm if you aren't going to optimize the hell out of it. Otherwise you probably won't beat compiler output from source which is much easier to read.

Other tips for understanding compiler output: compile with
gcc -S -fno-stack-check -fverbose-asm. The comments after each instruction are often nice reminders of what that load was for. Pretty soon it degenerates into a mess of temporaries with names like D.2983, but something like
movq 8(%rdi), %rcx # a_1(D)->elements, a_1(D)->elements will save you a round-trip to the ABI reference to see which function arg comes in in %rdi, and which struct member is at offset 8.

See also How to remove "noise" from GCC/clang assembly output?

What do the lines spanning from .cfi_startproc to call__main even do?

    _main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5

.cfi stuff is stack-unwind info for debuggers (and C++ exception handling) to unwind the stack It won't be there if you look at asm from objdump -d output instead of gcc -S, or you can use -fno-asynchronous-unwind-tables.

The stuff with pushing %ebp and then setting it to the value of the stack pointer on function entry sets up what's called a "stack frame". This is why %ebp is called the base pointer. These insns won't be there if you compile with -fomit-frame-pointer, which gives code an extra register to work with. That's on by default at -O2. (This is huge for 32bit x86, since that takes you from 6 to 7 usable regs. (%esp is still tied up being the stack pointer; stashing it temporarily in an xmm or mmx reg and then using it as another GP reg is possible in theory, but compilers will never do that and it makes async stuff like POSIX signals or Windows SEH unusable, as well as making debugging harder.)

The leave instruction before the ret is also part of this stack frame stuff.

Frame pointers are mostly historical baggage, but do make offsets into the stack frame consistent. With debug symbols, you can backtrace the call stack just fine even with -fomit-frame-pointer, and it's the default for amd64. (The amd64 ABI has alignment requirements for the stack, is a LOT better in other ways, too. e.g. passes args in regs instead of on the stack.)

    andl    $-16, %esp
    subl    $16, %esp

The and aligns the stack to a 16-byte boundary, regardless of what it was before. The sub reserves 16 bytes on the stack for this function. (Notice how it's missing from the optimized version, because it optimizes away any need for memory storage of any variables.)

    call    ___main

__main (asm name = ___main) is part of cygwin: it calls constructor / init functions for shared libraries (including libc). On GNU/Linux, this is handled by _start (before main is reached) and even dynamic-linker hooks that let libc initialize itself before the executable's own _start is even reached. I've read that dynamic-linker hooks (or _start from a static executable) instead of code in main would be possible under Cygwin, but they simply choose not to do it that way.

(This old mailing list message indicates _main is for constructors, but that main shouldn't have to call it on platforms that support getting the startup code to call it.)

    movb    $5, 15(%esp)
    movb    $10, 14(%esp)
    movsbl  15(%esp), %edx
    movsbl  14(%esp), %eax
    addl    %edx, %eax
    leave
    ret

Why is the initial output of GCC so much more verbose?

Without optimizations enabled, gcc maps C statements as literally as possible into asm. Doing anything else would take more compile time. Thus, movb is from the initializers for your two variables. The return value is computed by doing two loads (with sign extension, because we need to upconvert to int BEFORE the add, to match the semantics of the C code as written, as far as overflow).

I cannot figure what the two subtraction operations are for.

There is only one sub instruction. It reserves space on the stack for the function's variables, before the call to __main. Which other sub are you talking about?

What do .section, .ident, .def .p2align, etc. etc. do?

See the manual for the GNU assembler. Also available locally as info pages: run info gas.

.ident and .def: Looks like gcc putting its stamp on the object file, so you can tell what compiler / assembler produced it. Not relevant, ignore these.

.section: determines what section of the ELF object file the bytes from all following instructions or data directives (e.g. .byte 0x00) go into, until the next .section assembler directive. Either code (read-only, shareable), data (initialized read/write data, private), or bss (block storage segment. zero-initialized, doesn't take any space in the object file).

.p2align: Power of 2 Align. Pad with nop instructions until the desired alignment. .align 16 is the same as .p2align 4. Jump instruction are faster when the target is aligned, because of instruction fetch in chunks of 16B, not crossing a page boundary, or just not crossing a cache-line boundary. (32B alignment is relevant when code is already in the uop cache of an Intel Sandybridge and later.) See Agner Fog's docs, for example.

The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.

Put the code of interest in a function by itself. A lot of things are special about main.

You are correct that a mov-immediate and a ret are all that's needed to implement the function, but gcc apparently doesn't have shortcuts for recognizing trivial whole-programs and omitting main's stack frame or the call to _main. >.<

Good question, though. As I said, just ignore all that crap and worry about just the small part you want to optimize.

Upvotes: 10

wallyk

Reputation: 57794

The -o0 option directs the output to a file named 0. Maybe you meant the optimization level (which is capital O)?: that disables optimizations.

I don't understand why there would be a call to ____main unless this was produced for some emulated or hooked environment. When I compile with gcc -O0 -c -S t.c, I get:

        .file   "t.c"
        .text
.globl main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movb    $5, -2(%rbp)
        movb    $10, -1(%rbp)
        movsbl  -2(%rbp), %edx
        movsbl  -1(%rbp), %eax
        leal    (%rdx,%rax), %eax
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-11)"
        .section        .note.GNU-stack,"",@progbits

Perhaps you were expecting a high level of optimization? This is what I get with gcc -O3 -c -S t.c:

        .file   "t.c"
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        movl    $15, %eax
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-11)"
        .section        .note.GNU-stack,"",@progbits

Except for the debug information, it is about as short as it can be. The same code is produced for gcc -O2 -c -S t.c and gcc -O1 -c -S t.c. That is, the slightest optimization evaluates all the constants at compile time.

Upvotes: 1

NlightNFotis

Reputation: 9803

.cfi (call frame information) directives are used in gas (Gnu ASsembler) mainly for debugging. They allow the debugger to unwind the stack. To disable them, you can use the following parameter when you invoke the compilation driver -fno-asynchronous-unwind-tables.

If you want to play with the compiler in general, you can use the following compilation driver invocation command -o <filename.S> -S -masm=intel -fno-asynchronous-unwind-tables <filename.C> or just use godbolt's interactive compiler

Upvotes: 6

nneonneo

Reputation: 179707

First off, the CFI stuff is there for debugging purposes (and, in C++, exception handling). It tells the debugger what the stack frame looks like at each instruction, so that the debugger can reconstruct the state of the program's variables. Those don't result in executable statements, and will have zero effect on a program's runtime performance.

I don't know what the call to __main is doing there - my GCC doesn't do that. In fact, my GCC (4.9.2) gives me the following for gcc test.c -S -O1:

    .section __TEXT,__text_startup,regular,pure_instructions
    .globl _main
_main:
LFB0:
    movl    $15, %eax
    ret
LFE0:
    .section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
    .set L$set$0,LECIE1-LSCIE1
    .long L$set$0
LSCIE1:
    .long   0
    .byte   0x1
    .ascii "zR\0"
    .byte   0x1
    .byte   0x78
    .byte   0x10
    .byte   0x1
    .byte   0x10
    .byte   0xc
    .byte   0x7
    .byte   0x8
    .byte   0x90
    .byte   0x1
    .align 3
LECIE1:
LSFDE1:
    .set L$set$1,LEFDE1-LASFDE1
    .long L$set$1
LASFDE1:
    .long   LASFDE1-EH_frame1
    .quad   LFB0-.
    .set L$set$2,LFE0-LFB0
    .quad L$set$2
    .byte   0
    .align 3
LEFDE1:
    .subsections_via_symbols

and would you look at that, _main is exactly the two-instruction sequence you expected. (The __eh_frame stuff is more debugging information in a different format).

Upvotes: 1

Marco van de Voort

Reputation: 26381

I think that part is just a fixed pattern that sets up a 16-byte aligned stack and the CFI is exception frame handling related.

Determining that those are not needed for any main() is hard since that is a global optimization because main might call functions in other compilation units.

And it is probably not worthwhile spending the time to optimize this trivial and fairly useless case.

If you feel otherwise, you can always start working on such an optimization and submit it to gcc.

Upvotes: 0

GCC Assembly Optimizations - Why are these equivalent?

Answers (5)

Related Questions