Mike
Mike

Reputation: 331

Analysis of simple ARM assembly code

I began to study the ARM assembly language with the GNU toolchains and create a very simple function example in C with the following code:

#include <stdint.h>

    uint32_t *a;
    uint32_t *b;
    uint32_t *c;

     __attribute__((naked)) void f() {

             *a += *c;
             *b += *c;
       }

After I used this commands in the terminal to see the assembly code:

arm-none-eabi-gcc -O1 -S -std=c99 example.c -o -

And that's the result:

    @ Function supports interworking.
    @ Naked Function: prologue and epilogue provided by programmer.
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    ldr r3, .L2
    ldr r2, .L2+4
    ldr r3, [r3]
    ldr r2, [r2] ; r2 <- &(*c)
    ldr ip, [r3]
    ldr r0, [r2] ; r0 <- *c
    ldr r1, .L2+8
    add r0, ip, r0
    str r0, [r3]
    ldr r3, [r1]
    ldr r2, [r2] ; why make the same thing 
    ldr r1, [r3]
    add r2, r1, r2
    str r2, [r3]
.L3:
    .align  2
.L2:
    .word   a
    .word   c
    .word   b
    .size   f, .-f
    .comm   c,4,4
    .comm   b,4,4
    .comm   a,4,4

My question is why the compiler load two times the address of the pointer c, if i well understand, that is the line

ldr  r2, [r2] 

I can't find a good reason for the compiler replicate this code. Thanks in advance.

Upvotes: 2

Views: 3305

Answers (3)

Carl Norum
Carl Norum

Reputation: 224904

If your pointers alias, the two dereferences are required. Think about what your algorithm does if you have a == c. If they can't alias, you need to add some restrict keywords. Here's an example that optimizes the way you expect:

#include <stdint.h>

void f(uint32_t * restrict a, uint32_t * restrict b, uint32_t * restrict c)
{
    *a += *c;
    *b += *c;
}

And assembly output (comments mine):

00000000 <f>:
   0:   e5922000    ldr r2, [r2]     // r2 = *c
   4:   e5903000    ldr r3, [r0]     // r3 = *a
   8:   e0833002    add r3, r3, r2   // r3 = r3 + r2 = *a + *c
   c:   e5803000    str r3, [r0]     // *a = r3 = *a + *c
  10:   e5910000    ldr r0, [r1]     // r0 = *b
  14:   e0800002    add r0, r0, r2   // r0 = r0 + r2 = *b + *c
  18:   e5810000    str r0, [r1]     // *b = r0 = *b + *c
  1c:   e12fff1e    bx  lr

Edit: Here is an example more like your original one, first without the restrict keywords and second with, in GCC's output format this time.

Example one (without restrict keywords) code:

#include <stdint.h>

__attribute__((naked))
void f(uint32_t *a, uint32_t *b, uint32_t *c)
{
    *a += *c;
    *b += *c;
}

Output:

f:
    ldr ip, [r0, #0]
    ldr r3, [r2, #0]
    add r3, ip, r3
    str r3, [r0, #0]
    ldr r0, [r1, #0]
    ldr r3, [r2, #0]
    add r3, r0, r3
    str r3, [r1, #0]

Example two (with restrict keywords) code:

#include <stdint.h>

__attribute__((naked))
void f(uint32_t * restrict a, uint32_t * restrict b, uint32_t * restrict c)
{
    *a += *c;
    *b += *c;
}

Output:

f:
    ldr r3, [r2, #0]
    ldr ip, [r1, #0]
    ldr r2, [r0, #0]
    add r2, r2, r3
    add r3, ip, r3
    str r2, [r0, #0]
    str r3, [r1, #0]

The second dereferencing of c isn't in the second program, shortening it by one instruction.

Upvotes: 6

old_timer
old_timer

Reputation: 71526

the add destroys r0 so we lose the value of c and have to reload it

ldr r2, .L2+4   get address of .data location of *c from .text
...
ldr r2, [r2] ; r2 = pointer to c
...
ldr r0, [r2] ; r0  = c
...
add r0, ip, r0 ; this destroys r0 it no longer holds the value of c
...
ldr r2, [r2] ; need the value of c again to add to b

Interesting yes that different versions of gcc and/or different optimizations choose a different mix of registers. But the same sequence with the additional load. The main thing here is why did it do this:

add r0, ip, r0
str r0, [r3]

instead of

add ip, ip, r0
str ip, [r3]

and then not need to re-load c?

Nuance of the peephole optimizer is my guess. Another related question is why start messing with **b before finishing up with storing a? Had it not done that it would have yet another free register. (no doubt another optimization)

Another interesting point is at least one of my gcc compilers produces this:

00001000 <_start>:
    1000:   eaffffff    b   1004 <fun>

00001004 <fun>:
    1004:   e59f2034    ldr r2, [pc, #52]   ; 1040 <fun+0x3c>
    1008:   e59f3034    ldr r3, [pc, #52]   ; 1044 <fun+0x40>
    100c:   e5921000    ldr r1, [r2]
    1010:   e5932000    ldr r2, [r3]
    1014:   e591c000    ldr ip, [r1]
    1018:   e5920000    ldr r0, [r2]
    101c:   e59f3024    ldr r3, [pc, #36]   ; 1048 <fun+0x44>
    1020:   e08c0000    add r0, ip, r0
    1024:   e5933000    ldr r3, [r3]
    1028:   e5810000    str r0, [r1]
    102c:   e5922000    ldr r2, [r2]
    1030:   e5931000    ldr r1, [r3]
    1034:   e0812002    add r2, r1, r2
    1038:   e5832000    str r2, [r3]
    103c:   e12fff1e    bx  lr
    1040:   00009054    andeq   r9, r0, r4, asr r0
    1044:   00009050    andeq   r9, r0, r0, asr r0
    1048:   0000904c    andeq   r9, r0, ip, asr #32

Disassembly of section .bss:

0000904c <__bss_start>:
    904c:   00000000    andeq   r0, r0, r0

00009050 <c>:
    9050:   00000000    andeq   r0, r0, r0

00009054 <a>:
    9054:   00000000    andeq   r0, r0, r0

With or without the naked you get the same thing, why was gcc so desperate to use every disposable register and not use the stack for example. Note in your compile it adds a then stores it in mine it adds a then loads *b then stores a. Not only did it move the load of **b up in the sequence but it also loaded the *b up before finishing the result of a.

so the naked thing didnt help here other than to remove the bx lr at the end of the function. what you can/should try is -fdump-rtl-all on the gcc command line (makes a LOT of files) and walk your way through those to see where gcc started and where it changed things and maybe that will determine the output or if not in the compiler guts then in the backend the peephole optimizer re-arranged things and not sure what the command line is to dump that.

Bottom line is that while over the long haul (tens of thousands, hundreds of thousands, millions of lines of code) the compiler/optmizer is going to outperform the human, but it is very easy to catch isolated portions of optimized code that can be hand tuned to be a little bit "better" depending on your definition of better. Note that fewer instructions is not always better.

Upvotes: 1

auselen
auselen

Reputation: 28087

Consecutive execution of ldr rX, [rX] would mean double dereferencing of whatever rX is pointing.

If I got your question right, first one as you say is:

ldr r2, [r2] ; r2 <- &(*c)

then second one becomes

ldr r2, [r2] ; r2 <- *(r2)

If that's not the question then from GCC docs (see bold part):

naked

This attribute is available on the ARM, AVR, MCORE, MSP430, NDS32, RL78, RX and SPU ports. It allows the compiler to construct the requisite function declaration, while allowing the body of the function to be assembly code. The specified function will not have prologue/epilogue sequences generated by the compiler. Only Basic asm statements can safely be included in naked functions (see Basic Asm). While using Extended asm or a mixture of Basic asm and “C” code may appear to work, they cannot be depended upon to work reliably and are not supported.

Upvotes: 0

Related Questions