Leo Lai
Leo Lai

Reputation: 873

Tricks compiler uses to compile basic arithmetic operations of 128-bit integer

I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:

typedef __int128_t int128_t;
typedef __uint128_t uint128_t;

uint128_t mul_to_128(uint64_t x, uint64_t y) {
  return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
  return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
  return x/y;
}

and I got:

mul_to_128(unsigned long, unsigned long):
        mov     rax, rdi
        mul     rsi
        ret
mul(unsigned __int128, unsigned __int128):
        imul    rsi, rdx
        mov     rax, rdi
        imul    rcx, rdi
        mul     rdx
        add     rcx, rsi
        add     rdx, rcx
        ret
div(unsigned __int128, unsigned __int128):
        sub     rsp, 8
        call    __udivti3 //what is this???
        add     rsp, 8
        ret

3 questions:

  1. The 1st function(cast 64-bit uint to 128-bit then multiply them) are much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just 1 multiplication. If you multiply 2 maximums of 64-bit uint, it definitely overflows out of a 64-bit register...How does it produce 128-bit result by just 1 64-bit-64-bit multiplication???
  2. I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says, hi as higher 4 bytes and lo as lower 4 bytes), and assemble the result like (hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently I was wrong...because it uses only 3 multiplications (2 of them are even imul...signed multiplication???why???). Can anyone tell me what gcc is thinking? And it is optimal?
  3. Cannot even understand the assembly of the division...push stack -> call something called __udivti3 then pop stack...is __udivti3 something big?(like table look-up?) and what stuff does gcc try to push before the call?

the godbolt link: https://godbolt.org/g/sIIaM3

Upvotes: 2

Views: 873

Answers (2)

Pete Becker
Pete Becker

Reputation: 76410

You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128 it will look for the results in the appropriate registers.

In the second example, think of the values as a1*2^64 + a0 and b1*2^64 + b0 (that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.

In the third example,__udivti3 is a function that does the division. It's not simple, so it doesn't get expanded inline.

Upvotes: 10

Jester
Jester

Reputation: 58772

  1. The mul rsi will produce a 128 bit result in rdx:rax, as any instruction set reference will tell you.
  2. The imul is used to get a 64 bit result. It works even for unsigned. Again, the instruction set reference says: "The two- and three-operand forms may also be used with unsigned operands because the lower half of the product is the same regardless if the operands are signed or unsigned." Other than that, yes, basically it's doing the double width equivalent of what you described. Only 3 multiplies, because the result of the 4th would not fit in the output 128 bit anyway.
  3. __udivti3 is just a helper function, you can look at its disassembly to see what it's doing.

Upvotes: 4

Related Questions