Reputation: 53
This is the code I have been using to implement extended multiplication of two 32-bit numbers. Is there a way to implement the similar logic by making a subroutine and using stack via parameter passing? Either with MUL instruction or without it ? Can anyone help?
[org 0x0100]
jmp start
multiplicand: dd 123122,0
multiplier: dd 66341
result: dd 0,0
start:
initialize: mov cl,32
mov bl,1
checkbit: test bl,[multiplier]
jz skip
multiply: mov ax, [multiplicand]
add [result],ax
mov ax, [multiplicand+2]
adc [result+2], ax
mov ax, [multiplicand+4]
adc [result+4], ax
mov ax, [multiplicand+6]
adc [result+6], ax
skip: shl bl,1
shr word [multiplier+2],1
rcr word [multiplier],1
shl word [multiplicand],1
rcl word [multiplicand+2],1
rcl word [multiplicand+4],1
rcl word [multiplicand+6],1
dec cl
jnz checkbit
mov ax, 0x4c00
int 0x21
Upvotes: 2
Views: 712
Reputation: 364811
As Jester points out, you can do 32x32 => 64-bit multiply using 4x mul
instructions, with appropriate add/adc to add the partial products into the 64-bit result. (It's possible to optimize away some of the adc [result+6], 0
in cases where carry can't have propagated that far, if you're not doing a multiply-accumulate into an existing non-zero 64-bit value.)
You need all 4 products of combinations of halves,
low * low
, low * high
/high * low
cross products, and high * high
.
(If you only want the low 32 bits of the result, same width as the inputs, you don't need the high * high
part: that would be entirely in the high half.)
(On more modern x86-64, see https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for the same thing using 64-bit chunks, and how it gets more efficient with mulx
(leaves CF untouched) and also ADOX/ADCX which become useful for larger products with more than 2 input chunks. e.g. 512-bit x 512-bit using 64-bit chunks aka limbs. The whitepaper has helpful diagrams of how partial products line up with overlap.)
Jester's answer is optimized for readability, storing / adding partial products into the result in memory after every multiply.
This is an attempt to keep them in registers, and avoiding a bit of redundant loading. (Although some of that may come at the expense of extra machine-code size, which on original 8086 and especially 8088 is costly, except when pre-fetching it in the shadow of a slow mul
instruction.) I also experimented with using stosw
to store the results and increment the output pointer (ES:DI). I'm assuming a flat / small memory model where DS=ES. Another option would be to store the results over the input args on the stack, if you delay stores until after the last load of each input.
Using more registers becomes more attractive if you don't need to save/restore them. I decided to let BX be destroyed by this version of the function, and to take a result pointer already in DI. (A register arg). In pure assembly language, it makes sense to use custom calling conventions, especially for relatively small helper functions; obviously one can add more push/pop to save/restore more of the registers this code uses.
Delaying the add
/adc
work for the 3rd mul
until after the final mul
seems to be a good optimization.
I also tried to schedule fast instructions (with no memory access) after mul
, to empty the prefetch buffer that will fill up during the very slow mul
. (Assuming we're tuning for actual 8088, where total cost ~= sum of total bytes stored/loaded, including code fetch, except for slow instructions like mul that take extra long, and give code-fetch a chance to fill up the 4 byte buffer.)
;;; inputs: result pointer in ES/DS:DI (unmodified on return), DS=ES assumed.
;;; dword X and Y on the stack.
;;; (For ease of comparison / adaptation, I left an unused word below them, where the result pointer was passed before)
;;to only ever use ES:DI, use an ES prefix on the [DI] addr modes, or just avoid STOSW and use (DS:) [DI+0..6] addressing.
;;; outputs: qword [ES:DI] = X * Y
;;; clobbers: AX, CX, DX, BX.
;;; SI saved/restored, DI restored.
;;; precondition: DF=0 (cld) assumed as part of the calling convention
;%define bp ebp
;%define di edi
;%define sp esp
multiply_v2:
push bp
mov bp, sp
push si
;%define bp ebp+2 ; saved-EBP is wider than saved-BP
;mov di, [bp+4] ; result pointer. Instead, caller passes it
mov ax, [bp+6] ; xl
mov si, ax ; xl
mov cx, [bp+10] ; yl
MUL cx ;; xl * yl
mov bx, dx ; r1 ; eventually store to [di-2 + 2]
stosw ; r0; ; mov [di], ax / add di, 2
mov ax, [bp+8] ; xh
MUL cx ;; xh * yl
xor cx, cx
add ax, bx
stosw ; r1 in mem, BX free, DI pointing at result+4 = r2
adc cx, dx ; r2 in CX (adc dx,0 / mov cx,dx)
; carry into r3 is impossible here: 0xffff^2 has a high half of 0xfffe
mov ax, [bp+12] ; yh
xchg si, ax ; save yh; we're done with xl. (xchg-with-AX saves a byte vs. mov: worth it on 8088)
MUL si ;; xl * yh
mov bx, dx ; save xlyh_hi (r2)
xchg si, ax ; save xlyh_lo (r1), get yh into AX for next mul
MUL word [bp+8] ;; xh * yh => r3:r2
add bx, cx ; r2: combine partial products from r2:r1 cross-multiplies
adc dx, 0 ; r3: which can carry-out, into a real r3 instead of needing to materialize a 0 / 1
add [di-4 + 2], si ; r1 at result+2, from 2nd cross-multiply
adc ax, bx ; r2, *can* carry farther
adc dx, 0 ; r3
stosw ; r2
mov [di], dx ; r3 at [result+6] = [di-6 + 6]
pop si
sub di, 6 ; 3 bytes. Still pays for itself in size savings of earlier stosw instructions vs. mov [di+4], on 8086/8088
;%define bp ebp
; mov sp, bp ; redundant
pop bp
ret
The commented-out %define
lines are from testing in 32-bit mode under Linux, with the following C caller. (It works, including some corner cases like 0xFFFFFFFF squared). The unused low 2 bytes of arg space ("home space" for the DI register arg?) are filled by the top of the return address in 32-bit mode, but push ebp
is also 4 bytes, not 2, so offsets would have to change.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
__attribute__((regparm(1)))
extern void multest_32b(uint64_t *result, uint32_t x, uint32_t y);
int main(int argc, char **argv) {
unsigned x = 0x12345, y = 0xffbcde98;
if (argc>=3) {
x = strtoll(argv[1], NULL, 0);
y = strtoll(argv[2], NULL, 0);
}
uint64_t test;
multest_32b(&test, x,y);
printf ("%#x * %#x = %#llx. (test ^ control = %#llx)\n", x, y, test, test ^ (x * (uint64_t)y) );
}
The 64-bit XOR result is of course 0 when they match, and GCC knows how to do 32x32 => 64-bit multiply with one mul
instruction.
I used a wrapper function to adapt the calling convention differences. I hoped I could get away without copying the stack args so I used XMM0 to save/restore EDI, and didn't change that when I realized that wasn't going to work. I also let it clobber the caller's BX, but it seems GCC made code that didn't depend on EBX, even though it's a call-preserved register in the i386 System V calling convention. I probably could have just done mov edi, eax
/ jmp
since this main probably doesn't use EDI either.
;; wrapper function adapter for 32-bit C callers
global multest_32b ; __attribute__((regparm(1)))
multest_32b:
movd xmm0, edi
mov edi, eax
push dword [esp+8] ; bleh, can't tail-call if we want to restore (E)DI
push dword [esp+8]
call multiply_v2
movd edi, xmm0
add esp, 8
ret
$ nasm -felf32 mul-ext.asm && gcc -m32 mul-test.c mul-ext.o
$ ./a.out
$ ./a.out 0x7feffeff 0xffeffeff
0x7feffeff * 0xffeffeff = 0x7fe7ff7ea0210201. (test ^ control = 0)
$ ./a.out 0xffffffff 0xffffffff
0xffffffff * 0xffffffff = 0xfffffffe00000001. (test ^ control = 0)
$ ./a.out
0x12345 * 0xffbcde98 = 0x122f89eeec6f8. (test ^ control = 0)
$ while ./a.out $(( $RANDOM * $RANDOM )) $(( $RANDOM * $RANDOM )) ;do :;done | fgrep -v 'control = 0)'
... runs until you ^C with no output, i.e. no error
I thought it might be fun to see what it looks like if you do two of the partial products with SSE2 pmuludq
. This version is untested. GCC has an unsigned __int128
on 64-bit targets, so the same testing strategy could be used with a wrapper function that adapted from register args to stack args.
Version 1 of this does the cross products with scalar, the other two with SIMD, using vector store / scalar reload instead of a bunch of shuffling, since we want to touch 3 of the 4 elements with scalar, and the low dword of that store is finished.
Untested, i386 System V calling convention. (Pure stack args, EAX, ECX, EDX and XMM regs call-clobbered).
global multiply_64x64_128_SSE2
multiply_64x64_128_SSE2: ; (uint64_t x, uint64_t y, uint128 *result) stack args
push ebx
%define ARGS esp+4
mov eax, [ARGS + 0] ; Xl
MUL dword [ARGS + 12] ;; Xl * Yh ; first cross product
;; assume aligned args, like Linux's version of i386 System V
pshufd xmm0, [ARGS + 0], 0b11_01_10_00 ; [ Yh, Xh, Yl, Xl ]
pshufd xmm1, xmm0, 0b1111_0101 ; [ Yh, Yh, Yl, Yl ]
pmuludq xmm0, xmm1 ; [ Xh*Yh, Xl*Yl ] hi * hi and low x low 64-bit halves where they should be in result.
mov ebx, eax ; r1
mov ecx, edx ; r2
mov eax, [ARGS + 4] ; Xh
MUL dword [ARGS + 12] ;; Xh * Yh r2:r1
add eax, ebx ; r1
adc edx, ecx ; r2
mov ecx, 0 ; can't xor-zero to avoid partial-register stalls unless we had another free register. setc / movzx is better on P6-family, false dep otherwise.
; or mov ecx, [ebx+12] / adc ecx,0 now instead of memory-source adc later: good on SnB-family, except store-forwarding neds to be ready sooner.
setc cl ; r3 carry-out from r2 materialized without reloading SIMD result yet
mov ebx, [ARGS+16] ; uint128 *result
movdqu [ebx], xmm0 ; vector store, then accumulate cross products into it.
add [ebx+4], eax ; r1
adc edx, [ebx+8] ; r2
mov [ebx+8], edx ; memory-destination ADC is inefficient on Intel
adc ecx, [ebx+12] ; r3
mov [ebx+12], ecx
pop ebx
ret
Alternate SSE2 version: use SIMD for both products that include Yl. Needs an extra free register vs. the earlier version, and reloads the SIMD store earlier in the critical path. But does save 1 scalar load uop.
global multiply_64x64_128_SSE2_v2
multiply_64x64_128_SSE2_v2: ; (uint64_t x, uint64_t y, uint128 *result) stack args
push ebx
push edi
%define ARGS esp+8
mov eax, [ARGS + 12] ; Yh
mov edi, eax
MUL dword [ARGS + 0] ;; Xl * Yh r2:r1 cross product
;; assume aligned args, like Linux's version of i386 System V
movdqu xmm0, [ARGS + 0] ; [ Yh, Yl, Xh, Xl ]
; pshufd xmm0, [ARGS + 0], 0b11'01'10'00 ; [ Yh, Xh, Yl, Xl ]
pshufd xmm1, xmm0, 0b11_01_00_10 ; [ Yh, Xh, Xl, Yl ] ; do both partial products involving Yl, using only 1 shuffle
pmuludq xmm0, xmm1 ; [ Xh*Yl(r2:r1), Xl*Yl (r1:r0)] ; low dword fully finished, and one cross product out of place
mov ebx, eax ; r1
mov eax, [ARGS + 4] ; Xh. mul [mem] micro-fuses on Intel SnB-family, so this is equal to mov eax,edi / mul [mem] only needing 1 free reg. But we need more later.
mov ecx, edx ; r2
MUL edi ;; Xh * Yh r3:r2
mov edi, [ARGS+16] ; uint128 *result
movdqu [edi], xmm0 ; vector store, then accumulate partial products into it.
add ebx, [edi+8] ; r1 (from SIMD cross product)
adc eax, [edi+12] ; r2
adc edx, 0
add [edi+4], ebx ; r1 (from SIMD low * low)
adc eax, ecx ; r2
mov [edi+8], eax ; memory-destination ADC is inefficient on Intel
adc edx, 0 ; r3
mov [edi+12], edx
pop ebx
ret
adc
is 2 uops on Intel before Broadwell (or Sandybridge for adc reg,0
), so I tried to schedule other instructions after it for better decoder throughput. But either version could be tuned more aggressively at the expense of readability (scattering related operations around more, instead of keeping them grouped).
They're also not really tuned for P6-family where partial-register stalls are a problem.
MULX would really be nice for having two explicit outputs, instead of implicit EDX:EAX, as much as for not touching FLAGS. Keeping an input in the same register and multiplying a couple different things by it into different output registers would save significant mov
instructions. I can really see why they designed it that way, instead of having one of the outputs be the implicit operand.
Upvotes: 0
Reputation: 58772
[org 0x0100]
jmp start
multiplicand: dd 123122
multiplier: dd 66341
result: dd 0
start:
push word [multiplier+2]
push word [multiplier]
push word [multiplicand+2]
push word [multiplicand]
call multiply
add sp, 8 ; free arguments
mov [result], ax ; expect result in dx:ax
mov [result+2], dx
mov ax, 0x4c00
int 0x21
multiply:
push bp
mov bp, sp
mov ax, [bp+4]
mul word [bp+8] ; xl * yl
mov cx, [bp+4]
imul cx, [bp+10] ; xl * yh
add dx, cx
mov cx, [bp+6]
imul cx, [bp+8] ; xh * yl
add dx, cx
mov sp, bp
pop bp
ret
It's not clear whether you need a 64 bit result, the above code produces 32 bits.
A 64 bit version may look like this:
[org 0x0100]
jmp start
multiplicand: dd 123122
multiplier: dd 66341
result: dd 0, 0
start:
push word [multiplier+2]
push word [multiplier]
push word [multiplicand+2]
push word [multiplicand]
push result ; pointer for result
call multiply
add sp, 10 ; free arguments
mov ax, 0x4c00
int 0x21
multiply:
push bp
mov bp, sp
push bx
mov bx, [bp+4] ; result
mov ax, [bp+6]
mul word [bp+10] ; xl * yl
mov [bx], ax ; r0
mov [bx+2], dx ; r1
mov ax, [bp+6]
mul word [bp+12] ; xl * yh
add [bx+2], ax ; r1
adc dx, 0
mov [bx+4], dx ; r2
mov ax, [bp+8]
mul word [bp+10] ; xh * yl
add [bx+2], ax
adc [bx+4], dx ; carry into the highest limb is possible here
mov dx, 0 ; inefficient but doesn't affect FLAGS
adc dx, 0 ; setc dl
mov [bx+6], dx ; r3
mov ax, [bp+8]
mul word [bp+12] ; xh * yh
add [bx+4], ax ; r2
adc [bx+6], dx ; r3
mov ax, bx ; return result
pop bx
mov sp, bp
pop bp
ret
(More efficient might be to keep the results of both of the last two multiplies in registers before adding, so we can avoid storing and then doing a memory-destination adc.)
Disclaimer: I have just backported the usual 32 bit convention, whereby an extra hidden argument is used to point to a caller reserved location for the result, which pointer is also returned. This code works, but no idea if 16 bit compilers really used this convention.
Upvotes: 2
Reputation: 14409
I guess, your issue is the lack of arithmetic functions for SP
, e.g. [sp + 4]. You can use BP
instead. In your own assembly function you are free how to pass arguments and result. I'll show you a way to pass arguments by stack and get the result on stack:
BITS 16
ORG 0x0100
jmp start
multiplicand: dd 123122,0 ; 0102 0x0001E0F2 -> 0x00000000
; 0106 0x00000000 -> 0x0001E0F2
multiplier: dd 66341 ; 010A 0x00010325 -> 0x00000000
result: dd 0,0 ; 010E 0x00000000 -> 0x0023B1F6
; 0112 0x00000000 -> 0x00000000
start:
push word [multiplicand + 6] ; bp + 22
push word [multiplicand + 4] ; bp + 20
push word [multiplicand + 2] ; bp + 18
push word [multiplicand + 0] ; bp + 16
push word [multiplier + 2] ; bp + 14
push word [multiplier + 0] ; bp + 12
push word [result + 6] ; bp + 10
push word [result + 4] ; bp + 8
push word [result + 2] ; bp + 6
push word [result + 0] ; bp + 4
call sub_mul
pop word [result + 0] ; Pop stack into `result`
pop word [result + 2]
pop word [result + 4]
pop word [result + 6]
add sp, 12 ; Clean up the rest of the stack ;
mov ax, 0x4c00
int 0x21
sub_mul:
push bp ; Prolog
mov bp, sp
initialize: mov cl,32
mov bl,1
checkbit: test bl,[bp + 12]
jz skip
multiply: mov ax, [bp + 16]
add [bp + 4],ax
mov ax, [bp + 18]
adc [bp + 6], ax
mov ax, [bp + 20]
adc [bp + 8], ax
mov ax, [bp + 22]
adc [bp + 10], ax
skip: shl bl,1
shr word [bp + 14],1
rcr word [bp + 12],1
shl word [bp + 16],1
rcl word [bp + 18],1
rcl word [bp + 20],1
rcl word [bp + 22],1
dec cl
jnz checkbit
leave ; Epilog
ret
Upvotes: 1