Reputation: 3543
I've just started learning ARM assembly and I don't understand why the GNU as syntax is not the same than for x86*.
As the directives are the same, I would have expected everything to be like x86* except the instructions themselves, but instead, I'm struggling to load the address of a string, etc. I'm starting from scratch by reading some PDF online, man 2 syscall
and decompiling basic examples because I'm not sure of the value of the various Hello World I can find online.
My issues:
%
sigil#
or $
sigil. In fact, if I compile mov r0, $0
, objdump -D
gives me back a mov r0, #1
.Everything assembles down to the same mov r0, #1
:
mov %r0, $1
10080: e3a00001 mov r0, #1
mov r0, $1
10084: e3a00001 mov r0, #1
mov %r0, #1
10088: e3a00001 mov r0, #1
mov r0, #1
1008c: e3a00001 mov r0, #1
I'm unable to use the address of label directly to load a string address, so I need to use a variable for that. mov r1, $hello
or ldr r1, $hello
do not work. In x86_64, I would have written mov $hello, %rsi
. So I'm doing what gcc does, I'm creating a word with the address of that other label.
I'm unable to put my constants .rodata
or I get a Error: internal_relocation (type: OFFSET_IMM) not fixed up
, but putting everything in .text
works (this part is not related to syntax)
.section .text
hello:
.asciz "Hello World\n"
.set hello_len, .-hello
hello_addr:
.word hello
.align 4
.global _start
_start:
mov r0, $1
ldr r1, hello_addr
mov r2, $hello_len
mov r7, $4
swi $0
mov r0, $0
mov r7, $1
swi $0
Upvotes: 14
Views: 5911
Reputation: 39591
The reason why the GNU Assembler (GAS) uses AT&T syntax for x86 assembly is for compatibility with AT&T's x86 assemblers. Instead of using a syntax based on Intel's official x86 assembly syntax, AT&T chose to create a new syntax based on their earlier 68000 and PDP-11 assemblers. When x86 support was added to the GNU compiler (GCC) it generated AT&T syntax assembly because that was the assembler they were using. When GAS was created sometime after this, the GNU assembler had to use that syntax.
However there was no version of the AT&T assembler for ARM CPUs. When the GNU project started porting GCC and GAS to ARM targets there was no reason to create their own new and incompatible syntax for ARM assembly. Instead they based the syntax used on ARM's official syntax. This means you can lookup ARM instructions in ARM's official documentation and use the syntax and operand order you see there with the GNU assembler. When writing x86 assembly in AT&T syntax you just have to know the rules and exceptions, which aren't officially documented anywhere.
The reason why you can't load an address directly into a register in ARM assembly isn't an issue of syntax. ARM CPU simply don't have a instruction that can do that. All ARM instructions are the same size, 32-bits, leaving no room to encode a 32-bit address as an immediate operand. However ARM assemblers do provide a pseudo-instruction form of LDR that can handle loading a 32-bit addresses and constants automatically: ldr r1, =hello
. This will cause the assembler to store the 32-bit constant in a literal table and use a PC relative LDR instruction to load it into memory. If the constant being loaded happens to be small enough to load directly using MOV or MVN that instruction is generated instead.
The reason why you can't put the constant in .rodata
is either because it's too far away to address using PC relative LDR instruction (it needs to be with in +/-4KB because that biggest displacement than can fit into a single 32-bit ARM instruction) or the object format you're using doesn't support PC relative addressing to a different section. (Your ldr r1, hello_addr
instruction uses PC relative addressing as there's no way to encode a 32-bit address in an ARM instruction.)
Upvotes: 18
Reputation: 71536
Assembly language is defined by the assembler, the program that parses it. It is in the best interest of the processor vendor (IP or chip) to create or have an assembler created. It is also in their best interest to document the machine language and as such they match the machine language with the assembly language that they have created or contracted, so that these items all work together. Assembly language is in no way some universal thing that works across all platforms, there is no reason to assume that for the same target the different assemblers would use the same assembly language, the most famous being the sad results of AT&T with intel x86. Intel could have done better sure, but it was CISC and made sense at the time (the mov instruction being so overloaded, but still the assembly language could have been a little cleaner, remember we are decades down the road now with a lot more experience).
GNU has so far as I can tell always ruined the assembly languages that exist for the target when a target is added, so they create a new assembly language for that target. Perhaps to intentionally be incompatible, close at times, but still enough to be incompatible. Likewise there are some directives that work across gnu assembler assembly languages but then there are differences. The reality is that it is not "GNU" but the individual or team that chooses to create that port to that target, and they do whatever they feel like which is the nature of assembly language.
If you learned x86 before ARM I truly feel for you, I really hope x86 was not your first assembly language. The percent sign register thing is not an x86 thing historically, it is a bit sad really that someone felt they needed to add it when at the time many assemblers had been written demonstrating the lack of a need for such a thing. ARMs assembly language be it GNU or one of the many flavors of ARM assembler is one of the most clean of the assembly languages out there, makes the most sense, least vague.
What matters is the machine code, the machine code is the standard you must conform to for that target, not the assembly language. Can you make the machine code, the assembly languages can and do vary, that is the nature of assembly language. As with AT&T and the folks that have done the individual GNU target ports, you are certainly welcome to write your own assembler and assembly language, if you use a common file format for your object output (elf in the case of ARM) then you can write your assembly language using your assembler then link it with C or other using GNU tools. Nobody is stopping you from doing this, it is a very good way to learn an instruction set, I prefer to write a disassembler or an instruction set simulator but writing an assembler (roughly a weekend task, maybe a few weeknights further for fine tuning) would also do quite well.
One could just as easily complain about how x86 GNU assembly language doesnt look like arm, or mips, fill in the blank. Not really relevant, there are very obvious reasons why. Ssemi-portable with either documentation or tools prior to the gnu port. Which in and of itself is why gnu assembler is even used at all...Someone would have made an alternate port if the arm backend was fashioned after some other processors commonly found syntax. Also note there is disturbing mangling of arms assembly happening in the gnu world, perhaps you should jump on that bandwagon...
To answer your actual questions, since you do have actual questions. These are completely different instruction sets x86 and arm. CISC vs RISC, you cant have a fixed size instruction and fit any size immediate you want in there. The immediates have rules (please read the ARM documentation for the instructions you are trying to use) otherwise you have to do a pc relative load, and the distance the pc relative load can go is limited as well as you perhaps understand from some x86 instructions that have limited reach. So far the various assemblers have given us a pseudo code solution:
ldr r0,=0x00110000
ldr r0,=0x12345678
ldr r0,=mylabel
ldr r0,mylabeladd
ldr r0,myvalue
b .
mylabeladd: .word mylabel
mylabel: .word 1,2,3,4
myvalue: .word 0x11223344
giving
00000000 <mylabeladd-0x18>:
0: e3a00811 mov r0, #1114112 ; 0x110000
4: e59f0024 ldr r0, [pc, #36] ; 30 <myvalue+0x4>
8: e59f0024 ldr r0, [pc, #36] ; 34 <myvalue+0x8>
c: e59f0004 ldr r0, [pc, #4] ; 18 <mylabeladd>
10: e59f0014 ldr r0, [pc, #20] ; 2c <myvalue>
14: eafffffe b 14 <mylabeladd-0x4>
00000018 <mylabeladd>:
18: 0000001c andeq r0, r0, r12, lsl r0
0000001c <mylabel>:
1c: 00000001 andeq r0, r0, r1
20: 00000002 andeq r0, r0, r2
24: 00000003 andeq r0, r0, r3
28: 00000004 andeq r0, r0, r4
0000002c <myvalue>:
2c: 11223344 ; <UNDEFINED> instruction: 0x11223344
30: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
34: 0000001c andeq r0, r0, r12, lsl r0
If they cant fit it or if it is a label they create the value for you (in .text as you cannot assume that you can reach any other section). If they can the create a mov for you (at least GAS does).
Or you can craft the pc relative load yourself as in mylabeladd
If you want to reach any other section then you have to do it properly:
.globl _start
_start:
mov r3,#1
ldr r0,=mydata
str r3,[r0]
ldr r1,mydataadd
str r3,[r1]
b .
mydataadd: .word mydata
.data
mydata: .word 0
giving when linked
00001000 <_start>:
1000: e3a03001 mov r3, #1
1004: e59f0010 ldr r0, [pc, #16] ; 101c <mydataadd+0x4>
1008: e5803000 str r3, [r0]
100c: e59f1004 ldr r1, [pc, #4] ; 1018 <mydataadd>
1010: e5813000 str r3, [r1]
1014: eafffffe b 1014 <_start+0x14>
00001018 <mydataadd>:
1018: 80000000 andhi r0, r0, r0
101c: 80000000 andhi r0, r0, r0
Disassembly of section .data:
80000000 <__data_start>:
80000000: 00000000 andeq r0, r0, r0
The same thing you have to do for external labels, but for branching and such, which is in the same .text section, the linker will try to help you out.
.globl _start
_start:
b fun
in another file
.globl fun
fun:
b .
and no surprise...
00000000 <_start>: 0: eaffffff b 4
00000004 : 4: eafffffe b 4
but what if
.thumb
.thumb_func
.globl fun
fun:
b .
thank you gnu!
00000000 <_start>:
0: ea000000 b 8 <__fun_from_arm>
00000004 <fun>:
4: e7fe b.n 4 <fun>
...
00000008 <__fun_from_arm>:
8: e59fc000 ldr r12, [pc] ; 10 <__fun_from_arm+0x8>
c: e12fff1c bx r12
10: 00000005 andeq r0, r0, r5
14: 00000000 andeq r0, r0, r0
or simulate a really big program
.globl _start
_start:
b fun
.space 0x10000000
sigh:
arm-none-eabi-ld -Ttext=0 so.o x.o -o so.elf
so.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_JUMP24 against symbol `fun' defined in .text section in x.o
Well then just like reaching across sections
.globl _start
_start:
ldr r0,=fun
bx fun
.ltorg
.space 0x10000000
and that works...
00000000 <_start>:
0: e51f0000 ldr r0, [pc, #-0] ; 8 <_start+0x8>
4: e12fff10 bx r0
8: 1000000d andne r0, r0, sp
...
1000000c <fun>:
1000000c: e7fe b.n 1000000c <fun>
but you have to make sure the linker is helping you out as it might not and the trampoline from arm to thumb wasnt always there either...
.globl _start
_start:
b fun
.globl more_fun
more_fun:
b .
other file
.thumb
.thumb_func
.globl fun
fun:
b more_fun
produces perfectly broken code.
00000000 <_start>:
0: ea000002 b 10 <__fun_from_arm>
00000004 <more_fun>:
4: eafffffe b 4 <more_fun>
00000008 <fun>:
8: e7fc b.n 4 <more_fun>
a: 0000 movs r0, r0
c: 0000 movs r0, r0
...
00000010 <__fun_from_arm>:
10: e59fc000 ldr r12, [pc] ; 18 <__fun_from_arm+0x8>
14: e12fff1c bx r12
18: 00000009 andeq r0, r0, r9
1c: 00000000 andeq r0, r0, r0
Now had I used more gnu specific syntax that might have worked...
.globl _start
_start:
b fun
void more_fun ( void )
{
return;
}
nope, guess not
00000000 <_start>:
0: ea000002 b 10 <__fun_from_arm>
00000004 <more_fun>:
4: e12fff1e bx lr
00000008 <fun>:
8: e7fc b.n 4 <more_fun>
a: 0000 movs r0, r0
c: 0000 movs r0, r0
...
00000010 <__fun_from_arm>:
10: e59fc000 ldr r12, [pc] ; 18 <__fun_from_arm+0x8>
14: e12fff1c bx r12
18: 00000009 andeq r0, r0, r9
1c: 00000000 andeq r0, r0, r0
all part of the fun though...Clearly you are dealing with different instruction sets x86, arm, mips, avr, msp430, pdp11, xtensa, risc-v, and other gnu supported targets. Once you learn one assembly language, or two or three, the rest are more similar than different, the syntax is the syntax, easy to move beyond, the real issues are what can you do or not do with that instruction set. And the answers often lie in the documentation from that vendor (not just some instruction set reference you googled)
Upvotes: 11