Davide
Davide

Reputation: 2124

How can I multiply two hex 128 bit numbers in assembly

I have two 128 bit numbers in memory in hexadecimal, for example (little endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

I've to perform the unsigned multiplication between these two numbers so my new number will be:

z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).

Can you give me some input about this problem?
I think the best solution is to take one byte par time.

Upvotes: 5

Views: 4175

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365577

As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.

__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

compiles to (gcc6.2 -O3 on Godbolt)

   imul    rsi, rdx        # a_hi * b_lo
   mov     rax, rdi
   imul    rcx, rdi        # b_hi * a_lo
   mul     rdx             # a_lo * b_lo  widening multiply
   add     rcx, rsi        # add the cross products ...
   add     rdx, rcx        # ... into the high 64 bits.
   ret

Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.

Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.

With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.


Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.


Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).

Upvotes: 10

fuz
fuz

Reputation: 93127

Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:

q1μ + q2 = a2b2
r1μ + r2 = a1b2
s1μ + s2 = a2b1
t1μ + t2 = a1b1

and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):

c = t1μ3 + (t2 + s1 + r1) μ2 + (s2 + r2 + q1) μ + q2

Upvotes: 9

Related Questions