muXXmit2X
muXXmit2X

Reputation: 2765

looping over an array in NASM

I want to learn programming in assembly to write fast and efficient code. How ever I stumble over a problem I can't solve.

I want to loop over an array of double words and add its components like below:

%include "asm_io.inc"  
%macro prologue 0
    push    rbp
    mov     rbp,rsp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
string1 db  "result: ",0
array   dd  1, 2, 3, 4, 5

segment .bss


segment .text
global  sum

sum:
    prologue

    mov  rdi, string1
    call print_string

    mov  rbx, array
    mov  rdx, 0
    mov  ecx, 5

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

    mov  rdi, rdx
    call print_int
    call print_nl

epilogue

Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:

section .rodata
int_format  db  "%i",0
string_format db "%s",0

section .text
global  print_string, print_nl, print_int, read_int
extern printf, scanf, putchar

print_string:
    prologue
    ; string address has to be passed in rdi
    mov     rsi,rdi
    mov     rdi,dword string_format
    xor     rax,rax
    call    printf
    epilogue

print_nl:
    prologue
    mov     rdi,0xA
    xor     rax,rax
    call    putchar
    epilogue

print_int:
    prologue
    ;integer arg is in rdi
    mov     rsi, rdi
    mov     rdi, dword int_format
    xor     rax,rax
    call    printf
    epilogue

When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array. Can somebody tell me why th loop skips the first element?

Edit

I forgot to mention that I'm using a x86_64 Linux system

Upvotes: 3

Views: 2872

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 363980

I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.

If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).


The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like

#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);

Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.

An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.


It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P

My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.


So, on to your loop:

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.

Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).

When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.

;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4]  ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi]   ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8       ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4

ALIGN 16
.loop:
    add eax, [    rsi]
    add ebx, [4 + rsi]
    add rsi, 8
    cmp rsi, rdx
    jb .loop         ;  loop while rsi is Below one-past-the-end
;  TODO: handle odd-length
add eax, ebx
ret

Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.

If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)

All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.

segment .rodata         ; read-only data
ALIGN 16
array:  times 64    dd  1, 2, 3, 4, 5
array_bytes equ $-array
string1 db  "result: ",0

segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes]  ;  array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register.  We don't need rdx until later.
movdqu xmm0, [rsi - 32]   ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.

ALIGN 16
.loop:
    paddd  xmm0, [     rsi]   ; add packed dwords
    paddd  xmm1, [16 + rsi]
    add rsi, 32
    cmp rsi, rdx
    jb .loop         ;  loop: 4 fused-domain uops
paddd   xmm0, xmm1
phaddd  xmm0, xmm0     ; horizontal add: SSSE3 phaddd is simple but not optimal.  Better to pshufd/paddd
phaddd  xmm0, xmm0
movd    eax, xmm0
;  TODO: scalar cleanup loop
ret

Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.

It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.

With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.

Upvotes: 3

Related Questions