Reputation: 2765
I want to learn programming in assembly to write fast and efficient code. How ever I stumble over a problem I can't solve.
I want to loop over an array of double words and add its components like below:
%include "asm_io.inc"
%macro prologue 0
push rbp
mov rbp,rsp
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
string1 db "result: ",0
array dd 1, 2, 3, 4, 5
segment .bss
segment .text
global sum
sum:
prologue
mov rdi, string1
call print_string
mov rbx, array
mov rdx, 0
mov ecx, 5
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
mov rdi, rdx
call print_int
call print_nl
epilogue
Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:
section .rodata
int_format db "%i",0
string_format db "%s",0
section .text
global print_string, print_nl, print_int, read_int
extern printf, scanf, putchar
print_string:
prologue
; string address has to be passed in rdi
mov rsi,rdi
mov rdi,dword string_format
xor rax,rax
call printf
epilogue
print_nl:
prologue
mov rdi,0xA
xor rax,rax
call putchar
epilogue
print_int:
prologue
;integer arg is in rdi
mov rsi, rdi
mov rdi, dword int_format
xor rax,rax
call printf
epilogue
When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array. Can somebody tell me why th loop skips the first element?
I forgot to mention that I'm using a x86_64 Linux system
Upvotes: 3
Views: 2872
Reputation: 363980
I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm
and layout reg
should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.
If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave
is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp
taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer
for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).
The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like
#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);
Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.
An alternative is to use some assembler macros to %define
some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.
It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int
etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n
.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P
My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.
So, on to your loop:
lp:
mov rax, [rbx]
add rdx, rax
add rbx, 4
loop lp
Never use the loop
instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop
has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp
or cmp rbx, array_end / jb lp
would let your loop run at one iteration per cycle.
Since you're using a single-register addressing mode, using add rdx, [rbx]
would also be more efficient than a separate mov
-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi]
or something would stay micro-fused on Haswell and later).
When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs
insn implicitly uses them that way, which is why they're named si
and di
. Never use extra mov
instructions just because of register names, though. If you want more readability, use C with a good compiler.
;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4] ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi] ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8 ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4
ALIGN 16
.loop:
add eax, [ rsi]
add ebx, [4 + rsi]
add rsi, 8
cmp rsi, rdx
jb .loop ; loop while rsi is Below one-past-the-end
; TODO: handle odd-length
add eax, ebx
ret
Don't use this code without debugging it. gdb (with layout asm
and layout reg
) is not bad, and available in every Linux distro.
If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)
All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.
segment .rodata ; read-only data
ALIGN 16
array: times 64 dd 1, 2, 3, 4, 5
array_bytes equ $-array
string1 db "result: ",0
segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes] ; array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register. We don't need rdx until later.
movdqu xmm0, [rsi - 32] ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.
ALIGN 16
.loop:
paddd xmm0, [ rsi] ; add packed dwords
paddd xmm1, [16 + rsi]
add rsi, 32
cmp rsi, rdx
jb .loop ; loop: 4 fused-domain uops
paddd xmm0, xmm1
phaddd xmm0, xmm0 ; horizontal add: SSSE3 phaddd is simple but not optimal. Better to pshufd/paddd
phaddd xmm0, xmm0
movd eax, xmm0
; TODO: scalar cleanup loop
ret
Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.
It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add
and macro-fused cmp/jb
. Unrolling to 4 paddd
per iteration would do the trick for Sandybridge, and probably help on Haswell, too.
With AVX2 vpadd ymm1, [32+rsi]
, you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1
/ vpaddd xmm0, xmm0,xmm1
, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.
Upvotes: 3