Benoît
Benoît

Reputation: 3543

Why is GNU as syntax different between x86 and ARM?

I've just started learning ARM assembly and I don't understand why the GNU as syntax is not the same than for x86*.

As the directives are the same, I would have expected everything to be like x86* except the instructions themselves, but instead, I'm struggling to load the address of a string, etc. I'm starting from scratch by reading some PDF online, man 2 syscall and decompiling basic examples because I'm not sure of the value of the various Hello World I can find online.

My issues:

Everything assembles down to the same mov r0, #1:

        mov %r0, $1
   10080:       e3a00001        mov     r0, #1
        mov r0, $1
   10084:       e3a00001        mov     r0, #1
        mov %r0, #1
   10088:       e3a00001        mov     r0, #1
        mov r0, #1
   1008c:       e3a00001        mov     r0, #1

.section .text
hello:
        .asciz "Hello World\n"
        .set hello_len, .-hello

hello_addr:
        .word hello

.align 4
.global _start
_start:
        mov r0, $1
        ldr r1, hello_addr
        mov r2, $hello_len
        mov r7, $4
        swi $0

        mov r0, $0
        mov r7, $1
        swi $0

Upvotes: 14

Views: 5911

Answers (2)

Ross Ridge
Ross Ridge

Reputation: 39591

The reason why the GNU Assembler (GAS) uses AT&T syntax for x86 assembly is for compatibility with AT&T's x86 assemblers. Instead of using a syntax based on Intel's official x86 assembly syntax, AT&T chose to create a new syntax based on their earlier 68000 and PDP-11 assemblers. When x86 support was added to the GNU compiler (GCC) it generated AT&T syntax assembly because that was the assembler they were using. When GAS was created sometime after this, the GNU assembler had to use that syntax.

However there was no version of the AT&T assembler for ARM CPUs. When the GNU project started porting GCC and GAS to ARM targets there was no reason to create their own new and incompatible syntax for ARM assembly. Instead they based the syntax used on ARM's official syntax. This means you can lookup ARM instructions in ARM's official documentation and use the syntax and operand order you see there with the GNU assembler. When writing x86 assembly in AT&T syntax you just have to know the rules and exceptions, which aren't officially documented anywhere.

The reason why you can't load an address directly into a register in ARM assembly isn't an issue of syntax. ARM CPU simply don't have a instruction that can do that. All ARM instructions are the same size, 32-bits, leaving no room to encode a 32-bit address as an immediate operand. However ARM assemblers do provide a pseudo-instruction form of LDR that can handle loading a 32-bit addresses and constants automatically: ldr r1, =hello. This will cause the assembler to store the 32-bit constant in a literal table and use a PC relative LDR instruction to load it into memory. If the constant being loaded happens to be small enough to load directly using MOV or MVN that instruction is generated instead.

The reason why you can't put the constant in .rodata is either because it's too far away to address using PC relative LDR instruction (it needs to be with in +/-4KB because that biggest displacement than can fit into a single 32-bit ARM instruction) or the object format you're using doesn't support PC relative addressing to a different section. (Your ldr r1, hello_addr instruction uses PC relative addressing as there's no way to encode a 32-bit address in an ARM instruction.)

Upvotes: 18

old_timer
old_timer

Reputation: 71536

Assembly language is defined by the assembler, the program that parses it. It is in the best interest of the processor vendor (IP or chip) to create or have an assembler created. It is also in their best interest to document the machine language and as such they match the machine language with the assembly language that they have created or contracted, so that these items all work together. Assembly language is in no way some universal thing that works across all platforms, there is no reason to assume that for the same target the different assemblers would use the same assembly language, the most famous being the sad results of AT&T with intel x86. Intel could have done better sure, but it was CISC and made sense at the time (the mov instruction being so overloaded, but still the assembly language could have been a little cleaner, remember we are decades down the road now with a lot more experience).

GNU has so far as I can tell always ruined the assembly languages that exist for the target when a target is added, so they create a new assembly language for that target. Perhaps to intentionally be incompatible, close at times, but still enough to be incompatible. Likewise there are some directives that work across gnu assembler assembly languages but then there are differences. The reality is that it is not "GNU" but the individual or team that chooses to create that port to that target, and they do whatever they feel like which is the nature of assembly language.

If you learned x86 before ARM I truly feel for you, I really hope x86 was not your first assembly language. The percent sign register thing is not an x86 thing historically, it is a bit sad really that someone felt they needed to add it when at the time many assemblers had been written demonstrating the lack of a need for such a thing. ARMs assembly language be it GNU or one of the many flavors of ARM assembler is one of the most clean of the assembly languages out there, makes the most sense, least vague.

What matters is the machine code, the machine code is the standard you must conform to for that target, not the assembly language. Can you make the machine code, the assembly languages can and do vary, that is the nature of assembly language. As with AT&T and the folks that have done the individual GNU target ports, you are certainly welcome to write your own assembler and assembly language, if you use a common file format for your object output (elf in the case of ARM) then you can write your assembly language using your assembler then link it with C or other using GNU tools. Nobody is stopping you from doing this, it is a very good way to learn an instruction set, I prefer to write a disassembler or an instruction set simulator but writing an assembler (roughly a weekend task, maybe a few weeknights further for fine tuning) would also do quite well.

One could just as easily complain about how x86 GNU assembly language doesnt look like arm, or mips, fill in the blank. Not really relevant, there are very obvious reasons why. Ssemi-portable with either documentation or tools prior to the gnu port. Which in and of itself is why gnu assembler is even used at all...Someone would have made an alternate port if the arm backend was fashioned after some other processors commonly found syntax. Also note there is disturbing mangling of arms assembly happening in the gnu world, perhaps you should jump on that bandwagon...

To answer your actual questions, since you do have actual questions. These are completely different instruction sets x86 and arm. CISC vs RISC, you cant have a fixed size instruction and fit any size immediate you want in there. The immediates have rules (please read the ARM documentation for the instructions you are trying to use) otherwise you have to do a pc relative load, and the distance the pc relative load can go is limited as well as you perhaps understand from some x86 instructions that have limited reach. So far the various assemblers have given us a pseudo code solution:

ldr r0,=0x00110000
ldr r0,=0x12345678
ldr r0,=mylabel
ldr r0,mylabeladd
ldr r0,myvalue
b .

mylabeladd: .word mylabel
mylabel: .word 1,2,3,4
myvalue: .word 0x11223344

giving

00000000 <mylabeladd-0x18>:
   0:   e3a00811    mov r0, #1114112    ; 0x110000
   4:   e59f0024    ldr r0, [pc, #36]   ; 30 <myvalue+0x4>
   8:   e59f0024    ldr r0, [pc, #36]   ; 34 <myvalue+0x8>
   c:   e59f0004    ldr r0, [pc, #4]    ; 18 <mylabeladd>
  10:   e59f0014    ldr r0, [pc, #20]   ; 2c <myvalue>
  14:   eafffffe    b   14 <mylabeladd-0x4>

00000018 <mylabeladd>:
  18:   0000001c    andeq   r0, r0, r12, lsl r0

0000001c <mylabel>:
  1c:   00000001    andeq   r0, r0, r1
  20:   00000002    andeq   r0, r0, r2
  24:   00000003    andeq   r0, r0, r3
  28:   00000004    andeq   r0, r0, r4

0000002c <myvalue>:
  2c:   11223344            ; <UNDEFINED> instruction: 0x11223344
  30:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000
  34:   0000001c    andeq   r0, r0, r12, lsl r0

If they cant fit it or if it is a label they create the value for you (in .text as you cannot assume that you can reach any other section). If they can the create a mov for you (at least GAS does).

Or you can craft the pc relative load yourself as in mylabeladd

If you want to reach any other section then you have to do it properly:

.globl _start
_start:

mov r3,#1
ldr r0,=mydata
str r3,[r0]
ldr r1,mydataadd
str r3,[r1]
b .
mydataadd: .word mydata
.data
mydata: .word 0

giving when linked

00001000 <_start>:
    1000:   e3a03001    mov r3, #1
    1004:   e59f0010    ldr r0, [pc, #16]   ; 101c <mydataadd+0x4>
    1008:   e5803000    str r3, [r0]
    100c:   e59f1004    ldr r1, [pc, #4]    ; 1018 <mydataadd>
    1010:   e5813000    str r3, [r1]
    1014:   eafffffe    b   1014 <_start+0x14>

00001018 <mydataadd>:
    1018:   80000000    andhi   r0, r0, r0
    101c:   80000000    andhi   r0, r0, r0

Disassembly of section .data:

80000000 <__data_start>:
80000000:   00000000    andeq   r0, r0, r0

The same thing you have to do for external labels, but for branching and such, which is in the same .text section, the linker will try to help you out.

.globl _start
_start:

b fun

in another file

.globl fun
fun:
    b .

and no surprise...

00000000 <_start>: 0: eaffffff b 4

00000004 : 4: eafffffe b 4

but what if

.thumb
.thumb_func
.globl fun
fun:
    b .

thank you gnu!

00000000 <_start>:
   0:   ea000000    b   8 <__fun_from_arm>

00000004 <fun>:
   4:   e7fe        b.n 4 <fun>
    ...

00000008 <__fun_from_arm>:
   8:   e59fc000    ldr r12, [pc]   ; 10 <__fun_from_arm+0x8>
   c:   e12fff1c    bx  r12
  10:   00000005    andeq   r0, r0, r5
  14:   00000000    andeq   r0, r0, r0

or simulate a really big program

.globl _start
_start:

b fun

.space 0x10000000

sigh:

arm-none-eabi-ld -Ttext=0 so.o x.o -o so.elf
so.o: In function `_start':
(.text+0x0): relocation truncated to fit: R_ARM_JUMP24 against symbol `fun' defined in .text section in x.o

Well then just like reaching across sections

.globl _start
_start:

ldr r0,=fun
bx fun
.ltorg
.space 0x10000000

and that works...

00000000 <_start>:
       0:   e51f0000    ldr r0, [pc, #-0]   ; 8 <_start+0x8>
       4:   e12fff10    bx  r0
       8:   1000000d    andne   r0, r0, sp
    ...

1000000c <fun>:
1000000c:   e7fe        b.n 1000000c <fun>

but you have to make sure the linker is helping you out as it might not and the trampoline from arm to thumb wasnt always there either...

.globl _start
_start:

    b fun

.globl more_fun
more_fun:
    b .

other file

.thumb
.thumb_func
.globl fun
fun:
    b more_fun

produces perfectly broken code.

00000000 <_start>:
   0:   ea000002    b   10 <__fun_from_arm>

00000004 <more_fun>:
   4:   eafffffe    b   4 <more_fun>

00000008 <fun>:
   8:   e7fc        b.n 4 <more_fun>
   a:   0000        movs    r0, r0
   c:   0000        movs    r0, r0
    ...

00000010 <__fun_from_arm>:
  10:   e59fc000    ldr r12, [pc]   ; 18 <__fun_from_arm+0x8>
  14:   e12fff1c    bx  r12
  18:   00000009    andeq   r0, r0, r9
  1c:   00000000    andeq   r0, r0, r0

Now had I used more gnu specific syntax that might have worked...

.globl _start
_start:

    b fun

void more_fun ( void )
{
    return;
}

nope, guess not

00000000 <_start>:
   0:   ea000002    b   10 <__fun_from_arm>

00000004 <more_fun>:
   4:   e12fff1e    bx  lr

00000008 <fun>:
   8:   e7fc        b.n 4 <more_fun>
   a:   0000        movs    r0, r0
   c:   0000        movs    r0, r0
    ...

00000010 <__fun_from_arm>:
  10:   e59fc000    ldr r12, [pc]   ; 18 <__fun_from_arm+0x8>
  14:   e12fff1c    bx  r12
  18:   00000009    andeq   r0, r0, r9
  1c:   00000000    andeq   r0, r0, r0

all part of the fun though...Clearly you are dealing with different instruction sets x86, arm, mips, avr, msp430, pdp11, xtensa, risc-v, and other gnu supported targets. Once you learn one assembly language, or two or three, the rest are more similar than different, the syntax is the syntax, easy to move beyond, the real issues are what can you do or not do with that instruction set. And the answers often lie in the documentation from that vendor (not just some instruction set reference you googled)

Upvotes: 11

Related Questions