Reputation: 331
I began to study the ARM assembly language with the GNU toolchains and create a very simple function example in C with the following code:
#include <stdint.h>
uint32_t *a;
uint32_t *b;
uint32_t *c;
__attribute__((naked)) void f() {
*a += *c;
*b += *c;
}
After I used this commands in the terminal to see the assembly code:
arm-none-eabi-gcc -O1 -S -std=c99 example.c -o -
And that's the result:
@ Function supports interworking.
@ Naked Function: prologue and epilogue provided by programmer.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
ldr r3, .L2
ldr r2, .L2+4
ldr r3, [r3]
ldr r2, [r2] ; r2 <- &(*c)
ldr ip, [r3]
ldr r0, [r2] ; r0 <- *c
ldr r1, .L2+8
add r0, ip, r0
str r0, [r3]
ldr r3, [r1]
ldr r2, [r2] ; why make the same thing
ldr r1, [r3]
add r2, r1, r2
str r2, [r3]
.L3:
.align 2
.L2:
.word a
.word c
.word b
.size f, .-f
.comm c,4,4
.comm b,4,4
.comm a,4,4
My question is why the compiler load two times the address of the pointer c, if i well understand, that is the line
ldr r2, [r2]
I can't find a good reason for the compiler replicate this code. Thanks in advance.
Upvotes: 2
Views: 3305
Reputation: 224904
If your pointers alias, the two dereferences are required. Think about what your algorithm does if you have a == c
. If they can't alias, you need to add some restrict
keywords. Here's an example that optimizes the way you expect:
#include <stdint.h>
void f(uint32_t * restrict a, uint32_t * restrict b, uint32_t * restrict c)
{
*a += *c;
*b += *c;
}
And assembly output (comments mine):
00000000 <f>:
0: e5922000 ldr r2, [r2] // r2 = *c
4: e5903000 ldr r3, [r0] // r3 = *a
8: e0833002 add r3, r3, r2 // r3 = r3 + r2 = *a + *c
c: e5803000 str r3, [r0] // *a = r3 = *a + *c
10: e5910000 ldr r0, [r1] // r0 = *b
14: e0800002 add r0, r0, r2 // r0 = r0 + r2 = *b + *c
18: e5810000 str r0, [r1] // *b = r0 = *b + *c
1c: e12fff1e bx lr
Edit: Here is an example more like your original one, first without the restrict
keywords and second with, in GCC's output format this time.
Example one (without restrict
keywords) code:
#include <stdint.h>
__attribute__((naked))
void f(uint32_t *a, uint32_t *b, uint32_t *c)
{
*a += *c;
*b += *c;
}
Output:
f:
ldr ip, [r0, #0]
ldr r3, [r2, #0]
add r3, ip, r3
str r3, [r0, #0]
ldr r0, [r1, #0]
ldr r3, [r2, #0]
add r3, r0, r3
str r3, [r1, #0]
Example two (with restrict
keywords) code:
#include <stdint.h>
__attribute__((naked))
void f(uint32_t * restrict a, uint32_t * restrict b, uint32_t * restrict c)
{
*a += *c;
*b += *c;
}
Output:
f:
ldr r3, [r2, #0]
ldr ip, [r1, #0]
ldr r2, [r0, #0]
add r2, r2, r3
add r3, ip, r3
str r2, [r0, #0]
str r3, [r1, #0]
The second dereferencing of c
isn't in the second program, shortening it by one instruction.
Upvotes: 6
Reputation: 71526
the add destroys r0 so we lose the value of c and have to reload it
ldr r2, .L2+4 get address of .data location of *c from .text
...
ldr r2, [r2] ; r2 = pointer to c
...
ldr r0, [r2] ; r0 = c
...
add r0, ip, r0 ; this destroys r0 it no longer holds the value of c
...
ldr r2, [r2] ; need the value of c again to add to b
Interesting yes that different versions of gcc and/or different optimizations choose a different mix of registers. But the same sequence with the additional load. The main thing here is why did it do this:
add r0, ip, r0
str r0, [r3]
instead of
add ip, ip, r0
str ip, [r3]
and then not need to re-load c?
Nuance of the peephole optimizer is my guess. Another related question is why start messing with **b before finishing up with storing a? Had it not done that it would have yet another free register. (no doubt another optimization)
Another interesting point is at least one of my gcc compilers produces this:
00001000 <_start>:
1000: eaffffff b 1004 <fun>
00001004 <fun>:
1004: e59f2034 ldr r2, [pc, #52] ; 1040 <fun+0x3c>
1008: e59f3034 ldr r3, [pc, #52] ; 1044 <fun+0x40>
100c: e5921000 ldr r1, [r2]
1010: e5932000 ldr r2, [r3]
1014: e591c000 ldr ip, [r1]
1018: e5920000 ldr r0, [r2]
101c: e59f3024 ldr r3, [pc, #36] ; 1048 <fun+0x44>
1020: e08c0000 add r0, ip, r0
1024: e5933000 ldr r3, [r3]
1028: e5810000 str r0, [r1]
102c: e5922000 ldr r2, [r2]
1030: e5931000 ldr r1, [r3]
1034: e0812002 add r2, r1, r2
1038: e5832000 str r2, [r3]
103c: e12fff1e bx lr
1040: 00009054 andeq r9, r0, r4, asr r0
1044: 00009050 andeq r9, r0, r0, asr r0
1048: 0000904c andeq r9, r0, ip, asr #32
Disassembly of section .bss:
0000904c <__bss_start>:
904c: 00000000 andeq r0, r0, r0
00009050 <c>:
9050: 00000000 andeq r0, r0, r0
00009054 <a>:
9054: 00000000 andeq r0, r0, r0
With or without the naked you get the same thing, why was gcc so desperate to use every disposable register and not use the stack for example. Note in your compile it adds a then stores it in mine it adds a then loads *b then stores a. Not only did it move the load of **b up in the sequence but it also loaded the *b up before finishing the result of a.
so the naked thing didnt help here other than to remove the bx lr at the end of the function. what you can/should try is -fdump-rtl-all on the gcc command line (makes a LOT of files) and walk your way through those to see where gcc started and where it changed things and maybe that will determine the output or if not in the compiler guts then in the backend the peephole optimizer re-arranged things and not sure what the command line is to dump that.
Bottom line is that while over the long haul (tens of thousands, hundreds of thousands, millions of lines of code) the compiler/optmizer is going to outperform the human, but it is very easy to catch isolated portions of optimized code that can be hand tuned to be a little bit "better" depending on your definition of better. Note that fewer instructions is not always better.
Upvotes: 1
Reputation: 28087
Consecutive execution of ldr rX, [rX]
would mean double dereferencing of whatever rX
is pointing.
If I got your question right, first one as you say is:
ldr r2, [r2] ; r2 <- &(*c)
then second one becomes
ldr r2, [r2] ; r2 <- *(r2)
If that's not the question then from GCC docs (see bold part):
naked
This attribute is available on the ARM, AVR, MCORE, MSP430, NDS32, RL78, RX and SPU ports. It allows the compiler to construct the requisite function declaration, while allowing the body of the function to be assembly code. The specified function will not have prologue/epilogue sequences generated by the compiler. Only Basic asm statements can safely be included in naked functions (see Basic Asm). While using Extended asm or a mixture of Basic asm and “C” code may appear to work, they cannot be depended upon to work reliably and are not supported.
Upvotes: 0