Reputation: 21
that's my solution i'd like to know if it's correct and what is another way of solving it
include 'emu8086.inc'
org 100h
mov CX,n
lea SI,a
mov AL,0
start_sum_a: ;sum all the n elements of the first array
add AL, [SI]
inc SI
loop start_sum_a
mov CX,n
lea SI,b
start_sum_b: ;sum all the n elements of the 2nd array to
add AL, [SI] ;the first sum
inc SI
loop start_sum_b
call print_num ;print the sum
ret
a db 1,3,5,7,9,11,13,15,17,18
b db 0,2,4,6,8,10,12,14,16,19
n dw 10
DEFINE_PRINT_NUM
DEFINE_PRINT_NUM_UNS
Upvotes: 2
Views: 2410
Reputation: 365267
what is another way of solving it
There are always lots of ways to do anything. Some will be more efficient than others, and there are different measurements of efficiency. Different efficiency measurements include code-size (in instruction bytes), or performance for small arrays or large arrays. For a real 8086, code size is usually the determining factor for performance, but for modern x86 CPUs, that's definitely not true. (See the x86 tag wiki for links to docs).
There's no need to store 10 in memory; it should be an equ
constant. IDK if you're supposed to pretend you're writing a function that doesn't take advantage of everything being assemble-time constants. If so, then just be careful how you use the constant. (Like don't write mov di
n + OFFSET a
to have the end-pointer calculated at assemble time.)
You could avoid the slow loop
instruction without increasing the number of instructions in the loop by counting an index down from the end of the array, and using an indexed addressing mode.
Also, since your arrays are adjacent, you could just use one loop that starts at the start of a and ends at the end of b
mov bx, OFFSET a ; no point in using LEA for this
mov si, length_ab - 1 ; index of the last element
xor ax,ax
sum_loop: ; do {
add al, [bx+si]
dec si
jg sum_loop ; } while(si > 0)
jmp print_num ; tailcall optimization: print_num will return directly to our caller
;call print_num
;ret
section .rodata
a: db 1,3,5,7,9,11,13,15,17,18
b: db 0,2,4,6,8,10,12,14,16,19
end_b: ; put a label after the end of b
length_ab equ $ - a ; this is NASM syntax, IDK if emu8086 accepts it
n equ 10
Or take advantage of a
being static: add AL, [a + SI]
. That might be slower on a real 8086, since it puts an extra 2 bytes of code inside the loop that 8086 would have to re-fetch every time. On modern CPUs, saving the mov bx, OFFSET a
instruction is worth it for total code-size. (If you were using the same pointer many times in a loop, having it in a register can make sense though.)
If you know your sum won't overflow a byte, you could do 2 bytes in parallel with add ax, [si]
, and at the end add al, ah
. But that's definitely a special case, and handling the general case (avoiding carry into the next byte) with SWAR techniques isn't going to work well with only 2-byte words. In 16-bit code for 386 or newer, you could use 32-bit registers and mask off the the odd and even bytes separately.
On some superscalar CPUs (like Intel pre-Sandybridge, which can only execute one load per clock cycle), this would be faster, allowing you to add nearly 2 bytes per clock:
xor ax,ax
xor dx,dx
sum_loop: ; do{
mov cx, [si]
add al, cl
add dl, ch
add si, 2
cmp si, end_a
jb sum_loop ; } while (si < end_pointer)
add al, dl
;; mov ah,0 ; if necessary
But on other CPUs, you might be better off just unrolling and using add al, [si]
/ add dl, [si+1]
instead of using a separate load instruction.
On CPUs other than Intel P6 and Sandybridge-family, al
and ah
aren't renamed separately, so add ah, ch
would have a false dependency on the full ax
register. That's why I used dl
instead of ah
.
Note that xor ax,ax
is not dependency-breaking at least on modern Intel CPUs (Haswell/Skylake). It does zero AX, but it doesn't remove the out-of-order execution data dependency on the old value of EAX. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. It might be dep-breaking on Sandybridge and earlier, but definitely prefer xor eax,eax
for zeroing registers.
If you didn't need your code to be compatible with obsolete 8086, you could use SSE2 psadbw
to do the whole thing in only a couple steps.
See Sum reduction of unsigned bytes without overflow, using SSE2 on Intel for an explanation.
Your two arrays are 20 bytes total, so we can hard-code that and handle it as 16 + 4.
pxor xmm0, xmm0 ; xmm0 = 0
movd xmm1, [a+16] ; load last 4 bytes
psadbw xmm1, xmm0 ; sum2 = xmm1 = |b[7]-0| + |b[8]-0] + ...
psadbw xmm0, [a] ; horizontal sum 16 bytes into 2 partial sums in the two 64-bit halves (sum0 and sum1)
; then combine those three 16-bit sums into a single sum.
paddw xmm1, xmm0 ; sum2 += sum0
punpckhqdq xmm0, xmm0 ; get the high half of xmm0
paddw xmm1, xmm0 ; sum2 += sum1
movd eax, xmm1
movzx eax, al ; truncate the sum to 8-bit
jmp print_num
section .rodata
ALIGN 16 ; having a aligned lets us use [a] as a memory operand, or movdqa
a: ...
b: ...
And yes, this will assemble in 16-bit mode (with NASM for example).
Truncating later instead of after every step is fine for addition, because wrap around or carry-out from the low byte is the same thing.
If you couldn't take advantage of a
and b
being adjacent, you could:
movdqu xmm0, [a]
movdqu xmm1, [b]
paddb xmm0, xmm1 ; add packed bytes (no carry across byte boundaries)
psrldq xmm0, 6 ; shift out the high 6 bytes from past the end of a and b
Or even avoid reading past the end of the array:
movq xmm0, [a]
pinsrw xmm0, [a+8], 4
I just realized that since you apparently do want the sum to wrap to 8 bits, you can use paddb
to make this more efficient. For large arrays, you can accumulate with paddb
and do one psadbw
at the end.
movd xmm1, [a+16] ; load last 4 bytes, zeroing the rest of the register
paddb xmm1, [a]
pxor xmm0, xmm0 ; xmm0 = 0
psadbw xmm1, xmm0 ; horizontal sum one vector of byte-sums
movhlps xmm0, xmm1 ; extract high half into a different register
paddw xmm0, xmm1
movd eax, xmm1
movzx eax, al ; truncate the sum to 8-bit
jmp print_num
Upvotes: 2
Reputation: 39366
i'd like to know if it's correct
Your solution looks OK, although I got a hunch that the function print_num defined in "emu8086.inc" will rather expect the number in the AX
register. So better change the mov AL,0
instruction into xor ax,ax
which will clear the whole AX
register and not just its low byte AL
.
what is another way of solving it
You could choose to do the work in a single loop if you setup separate pointers for both arrays.
lea si, a
lea di, b
mov cx, n
xor ax, ax
start_sum:
add al, [si] ;Element of a array
add al, [di] ;Element of b array
inc si
inc di
loop start_sum
But since the starting points of these arrays are a certain distance (10) apart in memory, there's a solution using just 1 pointer:
lea si, a
mov cx, n
xor ax, ax
start_sum:
add al, [si] ;Element of a array
add al, [si + 10] ;Element of b array
inc si
loop start_sum
Finally since these arrays are adjacent in memory, the loop can be simpler still. Just double the number of iterations (One of the suggestions by Peter Cordes):
lea si, a
mov cx, n
shl cx, 1 ;Double the counter
xor ax, ax
start_sum:
add al, [si] ;Element of a array or b array
inc si
loop start_sum
Upvotes: 2