Flip
Flip

Reputation: 962

Different function call assemblywith -O0 and -O1 with GCC on ARM

I would like to call an assembly function from C. It is part of a basic example for calling conventions.

The function is a basic:

int mult(int A, int B){
    return A*B
}

According to the Procedure Call Standard for the ARM® Architecture the parameters A and B should be in registers r0 and r1 respectively for the function call. The return value should be in r0.

Essentially then I would expect the function to be:

EXPORT mult
mult MULT r0, r0, r1
     BX lr

With GCC 7.2.1 (none) -O1 -mcpu=cortex-m4 -mabi=aapcs, I get the following: (using Compiler Explorer)

mult:
    mul     r0, r1, r0
    bx      lr

Which is what I expected. However. If I disable optimizations (-O0) I get the following nonsense:

mult:
    push    {r7}
    sub     sp, sp, #12
    add     r7, sp, #0
    str     r0, [r7, #4]
    str     r1, [r7]
    ldr     r3, [r7, #4]
    ldr     r2, [r7]
    mul     r3, r2, r3
    mov     r0, r3
    adds    r7, r7, #12
    mov     sp, r7
    pop     {r7}
    bx      lr

Which means GCC is using r7 as a frame pointer I think and passing all of the parameters and return values via the stack. Which is not according to the AAPCS.

Is this a bug with Compiler Explorer, GCC or have I missed something in the AAPCS? Why would -O0 have a fundamentally different calling convention than specified in the AAPCS document?

Upvotes: 0

Views: 281

Answers (3)

artless-noise-bye-due2AI
artless-noise-bye-due2AI

Reputation: 22420

This is not due to debugging in my opinion. -O0 takes out optimization passes. As a result the compiler doesn't see everything fits in registers nor that you don't call other functions. Hence it will always make a stack frame which is r7 in thumb2 (Cortex-m4).

If you code a much more busy function you will see a stack frame at even -O3. See why compiler writers try to get rid of them? You have trouble understanding things, but it also a horrible amount of code. goes even further and would see that,

  mov r0, xx  # our call sight, might also have to save r0-r3.
  mov r1, yy  # because mult might trash those.
  bl  mult
...
mult:
    mul     r0, r1, r0
    bx      lr

Can be replaced by,

mul  xx,yy,xx   # one instruction!

It is quite common for call overhead to be as much as the actual function body. Other features like a macro, an inline keyword or attribute, etc. can achieve similar effects. Compilers are really good at allocating register and getting rid of mov instructions. Your brain (or at least mine) is better at mapping high level problems to specific machine instructions, like clz, addc, etc. This is especially true if the higher level language doesn't have a way to denote what you want to do (use a carry, etc).

See also:

Upvotes: 1

Don't bother analyzing machine codes compiled for the debug mode, because they follow some very obscured sequences that allows step by step execution by breakpoints while keeping all the global/local variables visible.

It isn't only pointless, but more confusing if what you want is learning assembly.

Go for -O2 or even -O3 all the time.

Upvotes: 3

Flip
Flip

Reputation: 962

Thanks to Marc Glisse for pointing out the obvious.

What is happening is that GCC is

  1. storing r0(A) and r1(B) on the stack. Then;
  2. reading in the variable from the stack into r2 and r3.Then;
  3. performing the multiply and storing the result in r3. Then;
  4. moving the result from r3 into the return register r0.

This seems like it is actively trying to make things slower...

But it is still AAPCS.

My bad.

Thanks Marc

Edit:

As Jake 'Alquimista' LEE mentions this might make sense for debugging. All of the function values are available to the debugger on the stack.

Upvotes: 0

Related Questions