user123
user123

Reputation: 53

32-bit extended multiplication via stack

This is the code I have been using to implement extended multiplication of two 32-bit numbers. Is there a way to implement the similar logic by making a subroutine and using stack via parameter passing? Either with MUL instruction or without it ? Can anyone help?

[org 0x0100]
jmp start

multiplicand: dd 123122,0
multiplier:   dd 66341
result:       dd 0,0

start:
initialize:   mov cl,32 

              mov bl,1
checkbit:     test bl,[multiplier]
              jz skip

multiply:     mov ax, [multiplicand]
              add [result],ax
              mov ax, [multiplicand+2]
              adc [result+2], ax
              mov ax, [multiplicand+4]
              adc [result+4], ax
              mov ax, [multiplicand+6] 
              adc [result+6], ax      

skip:         shl bl,1               
              shr word [multiplier+2],1 
              rcr word [multiplier],1 

              shl word [multiplicand],1 
              rcl word [multiplicand+2],1 
              rcl word [multiplicand+4],1 
              rcl word [multiplicand+6],1 
              dec cl
              jnz checkbit

              mov ax, 0x4c00
              int 0x21

Upvotes: 2

Views: 712

Answers (3)

Peter Cordes
Peter Cordes

Reputation: 364811

As Jester points out, you can do 32x32 => 64-bit multiply using 4x mul instructions, with appropriate add/adc to add the partial products into the 64-bit result. (It's possible to optimize away some of the adc [result+6], 0 in cases where carry can't have propagated that far, if you're not doing a multiply-accumulate into an existing non-zero 64-bit value.)

You need all 4 products of combinations of halves,
low * low, low * high /high * low cross products, and high * high. (If you only want the low 32 bits of the result, same width as the inputs, you don't need the high * high part: that would be entirely in the high half.)

(On more modern x86-64, see https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for the same thing using 64-bit chunks, and how it gets more efficient with mulx (leaves CF untouched) and also ADOX/ADCX which become useful for larger products with more than 2 input chunks. e.g. 512-bit x 512-bit using 64-bit chunks aka limbs. The whitepaper has helpful diagrams of how partial products line up with overlap.)


Jester's answer is optimized for readability, storing / adding partial products into the result in memory after every multiply.

This is an attempt to keep them in registers, and avoiding a bit of redundant loading. (Although some of that may come at the expense of extra machine-code size, which on original 8086 and especially 8088 is costly, except when pre-fetching it in the shadow of a slow mul instruction.) I also experimented with using stosw to store the results and increment the output pointer (ES:DI). I'm assuming a flat / small memory model where DS=ES. Another option would be to store the results over the input args on the stack, if you delay stores until after the last load of each input.

Using more registers becomes more attractive if you don't need to save/restore them. I decided to let BX be destroyed by this version of the function, and to take a result pointer already in DI. (A register arg). In pure assembly language, it makes sense to use custom calling conventions, especially for relatively small helper functions; obviously one can add more push/pop to save/restore more of the registers this code uses.

Delaying the add/adc work for the 3rd mul until after the final mul seems to be a good optimization.

Optimized version (63 bytes: speed on 8088 / 8086 depends on size)

I also tried to schedule fast instructions (with no memory access) after mul, to empty the prefetch buffer that will fill up during the very slow mul. (Assuming we're tuning for actual 8088, where total cost ~= sum of total bytes stored/loaded, including code fetch, except for slow instructions like mul that take extra long, and give code-fetch a chance to fill up the 4 byte buffer.)

;;; inputs:  result pointer in ES/DS:DI (unmodified on return), DS=ES assumed.
;;;          dword X and Y on the stack.
;;;          (For ease of comparison / adaptation, I left an unused word below them, where the result pointer was passed before)
      ;;to only ever use ES:DI, use an ES prefix on the [DI] addr modes, or just avoid STOSW and use (DS:) [DI+0..6] addressing.
;;; outputs: qword [ES:DI] = X * Y
;;; clobbers: AX, CX, DX,  BX.
;;;           SI saved/restored, DI restored.
;;; precondition: DF=0  (cld) assumed as part of the calling convention
;%define bp ebp
;%define di edi
;%define sp esp

multiply_v2:
    push  bp
    mov   bp, sp
    push  si
;%define bp ebp+2      ; saved-EBP is wider than saved-BP

    ;mov   di, [bp+4]       ; result pointer.  Instead, caller passes it

    mov   ax, [bp+6]       ; xl
    mov   si, ax           ; xl
    mov   cx, [bp+10]      ; yl

    MUL   cx               ;; xl * yl
    mov   bx, dx           ; r1   ; eventually store to [di-2 + 2]
    stosw                  ; r0;  ; mov  [di], ax  /  add di, 2

    mov   ax, [bp+8]       ; xh
    MUL   cx               ;; xh * yl
    xor   cx, cx
    add   ax, bx
    stosw                       ; r1 in mem, BX free, DI pointing at result+4 = r2
    adc   cx, dx                ; r2 in CX   (adc dx,0 / mov cx,dx)
    ; carry into r3 is impossible here: 0xffff^2 has a high half of 0xfffe

    mov   ax, [bp+12]      ; yh
    xchg  si, ax             ; save yh; we're done with xl.  (xchg-with-AX saves a byte vs. mov: worth it on 8088)
    MUL   si               ;; xl * yh
    mov   bx, dx             ; save xlyh_hi (r2)
    xchg  si, ax             ; save xlyh_lo (r1), get yh into AX for next mul

    MUL   word [bp+8]      ;; xh * yh  => r3:r2
    add   bx, cx           ; r2: combine partial products from r2:r1 cross-multiplies
    adc   dx, 0            ; r3: which can carry-out, into a real r3 instead of needing to materialize a 0 / 1

    add   [di-4 + 2], si   ; r1  at result+2, from 2nd cross-multiply
    adc   ax, bx           ; r2, *can* carry farther
    adc   dx, 0            ; r3

    stosw                  ; r2
    mov   [di], dx         ; r3 at [result+6] = [di-6 + 6]

    pop   si
    sub   di, 6            ; 3 bytes.  Still pays for itself in size savings of earlier stosw instructions vs. mov [di+4], on 8086/8088

;%define bp ebp
    ; mov sp, bp   ; redundant
    pop   bp
    ret

The commented-out %define lines are from testing in 32-bit mode under Linux, with the following C caller. (It works, including some corner cases like 0xFFFFFFFF squared). The unused low 2 bytes of arg space ("home space" for the DI register arg?) are filled by the top of the return address in 32-bit mode, but push ebp is also 4 bytes, not 2, so offsets would have to change.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

__attribute__((regparm(1)))
extern void multest_32b(uint64_t *result, uint32_t x, uint32_t y);

int main(int argc, char **argv) {
    unsigned x = 0x12345, y = 0xffbcde98;
    if (argc>=3) {
        x = strtoll(argv[1], NULL, 0);
        y = strtoll(argv[2], NULL, 0);
    }
    uint64_t test;
    multest_32b(&test, x,y);
    printf ("%#x * %#x = %#llx.  (test ^ control = %#llx)\n", x, y, test, test ^ (x * (uint64_t)y)  );
}

The 64-bit XOR result is of course 0 when they match, and GCC knows how to do 32x32 => 64-bit multiply with one mul instruction.

I used a wrapper function to adapt the calling convention differences. I hoped I could get away without copying the stack args so I used XMM0 to save/restore EDI, and didn't change that when I realized that wasn't going to work. I also let it clobber the caller's BX, but it seems GCC made code that didn't depend on EBX, even though it's a call-preserved register in the i386 System V calling convention. I probably could have just done mov edi, eax / jmp since this main probably doesn't use EDI either.

;; wrapper function adapter for 32-bit C callers
global multest_32b              ; __attribute__((regparm(1)))
multest_32b:
    movd  xmm0, edi
    mov   edi, eax
    push  dword [esp+8]    ; bleh, can't tail-call if we want to restore (E)DI
    push  dword [esp+8]
    call  multiply_v2
    movd  edi, xmm0
    add   esp, 8
    ret
$ nasm -felf32 mul-ext.asm && gcc -m32 mul-test.c mul-ext.o
$ ./a.out
$ ./a.out 0x7feffeff 0xffeffeff 
0x7feffeff * 0xffeffeff = 0x7fe7ff7ea0210201.  (test ^ control = 0)
$ ./a.out 0xffffffff 0xffffffff 
0xffffffff * 0xffffffff = 0xfffffffe00000001.  (test ^ control = 0)

$ ./a.out 
0x12345 * 0xffbcde98 = 0x122f89eeec6f8.  (test ^ control = 0)
$ while ./a.out $(( $RANDOM * $RANDOM )) $(( $RANDOM * $RANDOM )) ;do :;done | fgrep -v 'control = 0)'
 ... runs until you ^C with no output, i.e. no error

64x64 => 128-bit, in 32-bit mode, using some SSE2

I thought it might be fun to see what it looks like if you do two of the partial products with SSE2 pmuludq. This version is untested. GCC has an unsigned __int128 on 64-bit targets, so the same testing strategy could be used with a wrapper function that adapted from register args to stack args.

Version 1 of this does the cross products with scalar, the other two with SIMD, using vector store / scalar reload instead of a bunch of shuffling, since we want to touch 3 of the 4 elements with scalar, and the low dword of that store is finished.

Untested, i386 System V calling convention. (Pure stack args, EAX, ECX, EDX and XMM regs call-clobbered).

global multiply_64x64_128_SSE2
multiply_64x64_128_SSE2:             ; (uint64_t x, uint64_t y,  uint128 *result)  stack args
    push  ebx
%define ARGS esp+4
    mov   eax,  [ARGS + 0]    ; Xl
    MUL   dword [ARGS + 12]   ;; Xl * Yh       ; first cross product

     ;; assume aligned args, like Linux's version of i386 System V
    pshufd  xmm0, [ARGS + 0], 0b11_01_10_00    ; [ Yh, Xh, Yl, Xl ]
    pshufd  xmm1, xmm0, 0b1111_0101            ; [ Yh, Yh, Yl, Yl ]
    pmuludq xmm0, xmm1                         ; [   Xh*Yh,   Xl*Yl ]  hi * hi and low x low 64-bit halves where they should be in result.

    mov   ebx, eax              ; r1
    mov   ecx, edx              ; r2
    mov   eax, [ARGS + 4]       ; Xh
    MUL   dword [ARGS + 12]     ;; Xh * Yh    r2:r1
    add   eax, ebx              ; r1
    adc   edx, ecx              ; r2
    mov   ecx, 0                      ; can't xor-zero to avoid partial-register stalls unless we had another free register. setc / movzx is better on P6-family, false dep otherwise.
                                      ; or mov ecx, [ebx+12] / adc ecx,0   now instead of memory-source adc later: good on SnB-family, except store-forwarding neds to be ready sooner.
    setc  cl                    ; r3 carry-out from r2 materialized without reloading SIMD result yet

    mov   ebx, [ARGS+16]        ; uint128 *result
    movdqu  [ebx], xmm0         ; vector store, then accumulate cross products into it.

    add   [ebx+4], eax          ; r1
    adc   edx, [ebx+8]          ; r2
     mov   [ebx+8], edx             ; memory-destination ADC is inefficient on Intel
    adc   ecx, [ebx+12]         ; r3
     mov   [ebx+12], ecx
    pop   ebx
    ret

Alternate SSE2 version: use SIMD for both products that include Yl. Needs an extra free register vs. the earlier version, and reloads the SIMD store earlier in the critical path. But does save 1 scalar load uop.

global multiply_64x64_128_SSE2_v2
multiply_64x64_128_SSE2_v2:             ; (uint64_t x, uint64_t y,  uint128 *result)  stack args
    push  ebx
    push  edi
%define ARGS esp+8
    mov   eax,  [ARGS + 12]    ; Yh
    mov   edi,  eax
    MUL   dword [ARGS + 0]     ;; Xl * Yh     r2:r1 cross product

     ;; assume aligned args, like Linux's version of i386 System V
    movdqu  xmm0, [ARGS + 0]                   ; [ Yh, Yl, Xh, Xl ]
        ;  pshufd  xmm0, [ARGS + 0], 0b11'01'10'00    ; [ Yh, Xh, Yl, Xl ]
    pshufd  xmm1, xmm0, 0b11_01_00_10          ; [ Yh, Xh, Xl, Yl ]      ; do both partial products involving Yl, using only 1 shuffle
    pmuludq xmm0, xmm1                         ; [   Xh*Yl(r2:r1),   Xl*Yl (r1:r0)]   ; low dword fully finished, and one cross product out of place

    mov   ebx, eax              ; r1
    mov   eax, [ARGS + 4]               ; Xh.  mul [mem] micro-fuses on Intel SnB-family, so this is equal to mov eax,edi / mul [mem] only needing 1 free reg.  But we need more later.
    mov   ecx, edx              ; r2
    MUL   edi                   ;; Xh * Yh    r3:r2

     mov   edi, [ARGS+16]        ; uint128 *result
     movdqu  [edi], xmm0         ; vector store, then accumulate partial products into it.

    add   ebx, [edi+8]          ; r1   (from SIMD cross product)
    adc   eax, [edi+12]         ; r2
    adc   edx, 0

    add   [edi+4], ebx          ; r1   (from SIMD low * low)
    adc   eax, ecx              ; r2
     mov   [edi+8], eax             ; memory-destination ADC is inefficient on Intel
    adc   edx, 0                ; r3
     mov   [edi+12], edx

    pop   ebx
    ret

adc is 2 uops on Intel before Broadwell (or Sandybridge for adc reg,0), so I tried to schedule other instructions after it for better decoder throughput. But either version could be tuned more aggressively at the expense of readability (scattering related operations around more, instead of keeping them grouped).

They're also not really tuned for P6-family where partial-register stalls are a problem.

MULX would really be nice for having two explicit outputs, instead of implicit EDX:EAX, as much as for not touching FLAGS. Keeping an input in the same register and multiplying a couple different things by it into different output registers would save significant mov instructions. I can really see why they designed it that way, instead of having one of the outputs be the implicit operand.

Upvotes: 0

Jester
Jester

Reputation: 58772

[org 0x0100]
jmp start

multiplicand: dd 123122
multiplier:   dd 66341
result:       dd 0

start:
    push word [multiplier+2]
    push word [multiplier]
    push word [multiplicand+2]
    push word [multiplicand]
    call multiply
    add sp, 8            ; free arguments
    mov [result], ax     ; expect result in dx:ax
    mov [result+2], dx

    mov ax, 0x4c00
    int 0x21

multiply:
    push bp
    mov bp, sp

    mov ax, [bp+4]
    mul word [bp+8]      ; xl * yl
    mov cx, [bp+4]
    imul cx, [bp+10]     ; xl * yh
    add dx, cx
    mov cx, [bp+6]
    imul cx, [bp+8]      ; xh * yl
    add dx, cx

    mov sp, bp
    pop bp
    ret

It's not clear whether you need a 64 bit result, the above code produces 32 bits.

A 64 bit version may look like this:

[org 0x0100]
jmp start

multiplicand: dd 123122
multiplier:   dd 66341
result:       dd 0, 0

start:
    push word [multiplier+2]
    push word [multiplier]
    push word [multiplicand+2]
    push word [multiplicand]
    push result           ; pointer for result
    call multiply
    add sp, 10            ; free arguments

    mov ax, 0x4c00
    int 0x21

multiply:
    push bp
    mov bp, sp
    push bx

    mov bx, [bp+4]       ; result

    mov ax, [bp+6]
    mul word [bp+10]     ; xl * yl
    mov [bx], ax         ; r0
    mov [bx+2], dx       ; r1

    mov ax, [bp+6]
    mul word [bp+12]     ; xl * yh
    add [bx+2], ax       ; r1
    adc dx, 0
    mov [bx+4], dx       ; r2

    mov ax, [bp+8]
    mul word [bp+10]     ; xh * yl
    add [bx+2], ax
    adc [bx+4], dx       ; carry into the highest limb is possible here
    mov dx, 0            ; inefficient but doesn't affect FLAGS
    adc dx, 0            ; setc dl
    mov [bx+6], dx       ; r3

    mov ax, [bp+8]
    mul word [bp+12]     ; xh * yh
    add [bx+4], ax       ; r2
    adc [bx+6], dx       ; r3

    mov ax, bx           ; return result
    pop bx
    mov sp, bp
    pop bp
    ret

(More efficient might be to keep the results of both of the last two multiplies in registers before adding, so we can avoid storing and then doing a memory-destination adc.)

Disclaimer: I have just backported the usual 32 bit convention, whereby an extra hidden argument is used to point to a caller reserved location for the result, which pointer is also returned. This code works, but no idea if 16 bit compilers really used this convention.

Upvotes: 2

rkhb
rkhb

Reputation: 14409

I guess, your issue is the lack of arithmetic functions for SP, e.g. [sp + 4]. You can use BP instead. In your own assembly function you are free how to pass arguments and result. I'll show you a way to pass arguments by stack and get the result on stack:

BITS 16
ORG 0x0100

jmp start

multiplicand: dd 123122,0                   ; 0102 0x0001E0F2 -> 0x00000000
                                            ; 0106 0x00000000 -> 0x0001E0F2
multiplier:   dd 66341                      ; 010A 0x00010325 -> 0x00000000
result:       dd 0,0                        ; 010E 0x00000000 -> 0x0023B1F6
                                            ; 0112 0x00000000 -> 0x00000000

start:
            push word [multiplicand + 6]    ; bp + 22
            push word [multiplicand + 4]    ; bp + 20
            push word [multiplicand + 2]    ; bp + 18
            push word [multiplicand + 0]    ; bp + 16

            push word [multiplier + 2]      ; bp + 14
            push word [multiplier + 0]      ; bp + 12

            push word [result + 6]          ; bp + 10
            push word [result + 4]          ; bp + 8
            push word [result + 2]          ; bp + 6
            push word [result + 0]          ; bp + 4

            call sub_mul

            pop word [result + 0]           ; Pop stack into `result`
            pop word [result + 2]
            pop word [result + 4]
            pop word [result + 6]
            add sp, 12                      ; Clean up the rest of the stack                        ;

            mov ax, 0x4c00
            int 0x21

sub_mul:
            push bp                         ; Prolog
            mov bp, sp

initialize:   mov cl,32

              mov bl,1
checkbit:     test bl,[bp + 12]
              jz skip

multiply:     mov ax, [bp + 16]
              add [bp + 4],ax
              mov ax, [bp + 18]
              adc [bp + 6], ax
              mov ax, [bp + 20]
              adc [bp + 8], ax
              mov ax, [bp + 22]
              adc [bp + 10], ax

skip:         shl bl,1
              shr word [bp + 14],1
              rcr word [bp + 12],1

              shl word [bp + 16],1
              rcl word [bp + 18],1
              rcl word [bp + 20],1
              rcl word [bp + 22],1
              dec cl
              jnz checkbit

        leave                           ; Epilog
        ret

Upvotes: 1

Related Questions