Reputation: 873
I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
uint128_t mul_to_128(uint64_t x, uint64_t y) {
return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
return x/y;
}
and I got:
mul_to_128(unsigned long, unsigned long):
mov rax, rdi
mul rsi
ret
mul(unsigned __int128, unsigned __int128):
imul rsi, rdx
mov rax, rdi
imul rcx, rdi
mul rdx
add rcx, rsi
add rdx, rcx
ret
div(unsigned __int128, unsigned __int128):
sub rsp, 8
call __udivti3 //what is this???
add rsp, 8
ret
3 questions:
64-bit
uint to 128-bit
then multiply them) are
much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just
1 multiplication. If you multiply 2 maximums of 64-bit uint, it
definitely overflows out of a 64-bit register...How does it produce
128-bit result by just 1 64-bit-64-bit multiplication??? hi
as higher 4 bytes
and lo
as lower 4 bytes), and assemble the result like
(hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2)
. Apparently
I was wrong...because it uses only 3 multiplications (2 of them
are even imul
...signed multiplication???why???). Can anyone tell me
what gcc is thinking? And it is optimal?__udivti3
then pop stack...is __udivti3
something
big?(like table look-up?) and what stuff does gcc try to push before the call?the godbolt link: https://godbolt.org/g/sIIaM3
Upvotes: 2
Views: 873
Reputation: 76410
You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128
it will look for the results in the appropriate registers.
In the second example, think of the values as a1*2^64 + a0
and b1*2^64 + b0
(that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0
. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.
In the third example,__udivti3
is a function that does the division. It's not simple, so it doesn't get expanded inline.
Upvotes: 10
Reputation: 58772
mul rsi
will produce a 128 bit result in rdx
:rax
, as any instruction set reference will tell you.imul
is used to get a 64 bit result. It works even for unsigned. Again, the instruction set reference says: "The two- and three-operand forms may also be used with unsigned operands because the lower half of the product
is the same regardless if the operands are signed or unsigned." Other than that, yes, basically it's doing the double width equivalent of what you described. Only 3 multiplies, because the result of the 4th would not fit in the output 128 bit anyway.__udivti3
is just a helper function, you can look at its disassembly to see what it's doing.Upvotes: 4