Reputation: 3
I'm finding it surprisingly difficult to find good, complete examples of assembly running on Apple Silicon, specifically for SIMD-type operations, rather than incomplete, overly-generic snippets.
For my own curiosity, I want to write an example on an M2 machine that...
I have the following source code, in a file named test.s
...
.global _start
.align 2
_start:
;;; Load numbers into x0
ldr x0, numbers
;;; Load elements from array in x0 into dedicated Neon register
ld1 { v0. 4s }, [x0]
;;; Accumulate elements in vector using dedicated Neon instruction
addv s0, v0.4s
;;; Prepare formatted string
adrp x0, format@page
add x0, x0, format@pageoff
;;; Add result to the stack for printing
str s0, [sp, #-16]!
;;; Print string
bl _printf
mov x16, #1
svc 0
numbers: .word 1, 2, 3, 4
format: .asciz "Answer: %u.\n"
..., assembled and linked using the following commands...
as -g -arch arm64 -o test.o test.s
ld -o test test.o -lSystem -syslibroot `xcrun -sdk macosx --show-sdk-path` -e _start -arch arm64
I'd have expected the answer to be 10
when I run the programme, but I get anything but.
What is it I'm not doing correctly?
Upvotes: 0
Views: 86
Reputation: 3
Thanks, both - had some interesting results when running through clang
, with Peter's suggestions.
;;; 16-byte align "numbers"
.p2align 4, 0x0 ; previously 2
numbers: .long 1, 2, 3, 4 ; previously .word
.global _start
.p2align 2 ; reset alignment
_start:
;;; Back up x29 and x30, and move stack pointer
sub sp, sp, #32
stp x29, x30, [sp, #16]
add x29, sp, #16
;;; Load numbers, as Nate has suggested
adrp x8, numbers@page
;;; Slightly different `ldr` approach, using q0
ldr q0, [x8, numbers@pageoff]
;;; Accumulate vector
addv.4s s0, v0
;;; Move 32-bit result to 32-bit GP register
fmov w8, s0
;;; Store 64-bit register counterpart onto the stack for printing
str x8, [sp]
;;; Prime string for printing
adrp x0, format@page
add x0, x0, format@pageoff
;;; Print string
bl _printf
;;; Prepare "return 0" from "int main()"
mov w0, #0
;;; Restore x29, x30, and original stack pointer
ldp x29, x30, [sp, #16]
add sp, sp, #32
;;; "return 0"
ret
format:
.asciz "Answer: %u.\n"
I received the following alignment error from the linker...
ld: 'numbers' from 'assembly.o' at 0x100003F6C not 16-byte aligned, which cannot be encoded as a target of LDR/STR in '_start'+12 from 'assembly.o'
final section layout:
__PAGEZERO addr=0x00000000, size=0x100000000, fileOffset=0x00000000
__TEXT addr=0x100000000, size=0x00004000, fileOffset=0x00000000
__text addr=0x100003f30, size=0x0000005d, fileOffset=0x00003f30
__stubs addr=0x100003f90, size=0x0000000c, fileOffset=0x00003f90
__unwind_info addr=0x100003f9c, size=0x00000060, fileOffset=0x00003f9c
__DATA_CONST addr=0x100004000, size=0x00004000, fileOffset=0x00004000
__got addr=0x100004000, size=0x00000008, fileOffset=0x00004000
__LINKEDIT addr=0x100008000, size=0x00004000, fileOffset=0x00008000
..., and needed the .p2align 4, 0x0
for numbers, in order to make it work.
Interesting to see the use of ldr q0, ...
instead of ld1 { v0.4s }, ...
and addv.4s s0, v0
instead of addv s0, v0.4s
, from the compiler.
Will need to do some more research into alignment, experimenting with other instructions, and the choice of x8
over, say, x2
or x3
(avoiding argument registers, maybe?).
Thanks again for your help.
Upvotes: 0
Reputation: 58673
ldr x0, numbers
is going to load from the address labeled numbers
into x0
(which only works because numbers
happens to be at a sufficiently nearby address to the instruction, in the same section). So the value in x0
will not be the address of numbers
, but rather the data stored there. You'll end up with x0
containing the value 0x0000000200000001
and the subsequent memory access will likely crash.
You should put the address of numbers
into x0
with an adrp/add
sequence just like you do with format
further down.
Also, st1
should be ld1
, as you already mentioned.
Changing these lines to
adrp x0, numbers@page
add x0, x0, numbers@pageoff
ld1 { v0.4s }, [x0]
makes the program print the correct value 10
for me.
Upvotes: 2