Reputation: 1121
I need to deal with bignum calculation (addition and subtraction, but I treat subtraction as equivalent to signed addition) on RISC-V and the situation is a bit complicated. What I gather from half an hour of internet research:
bltu
.As far as I can tell, the branches indeed cover most scenarios rather well, except for one: (signed) bignum addition. Because there, we hit the slowest check path in a hot loop.
I know only a little about ISA design, but why didn't they include an instruction that calculates (a + b) >> 32
(effectively the carry out)? A bit like how the multiplication instruction is split into mul
and mulh
as well. This would allow to do the desired calculation with always two instructions. More powerful micro architectures could then even detect the sequence and only do one addition.
Am I missing some tricks that would make this instruction obsolete (or be equivalent to it)? Does it have any major downsides that I oversee? I did not find a lot of good documentation on this general topic.
Upvotes: 11
Views: 5347
Reputation: 363882
add
/ sltu
gives you sum and carry-out: https://godbolt.org/z/Y7f5dzj1P shows GCC using it for unsigned math: sum=a+b
/ carry = sum<a
. Or for __builtin_uadd_overflow
But the problem with that is lack of ILP: the sltu
can't start until the add
result is ready. That could be solved if you could get carry-out directly from the inputs as you propose; good point. Of course fusion of add/sltu would also solve that problem; perhaps that's what the architects had in mind.
I don't see any CPU-design challenge in creating an instruction that produced a 0
or 1
output according to the carry-out from a 2-input addition. That would be very easy; any way of building a 32 or 64-bit adder to support the add
instruction could easily produce a carry-out signal from the high bit. In fact that's probably what sltu
reads, since it's normal for an integer ALU to use a single binary adder-subtractor, with NOT of one input and a carry-in of 1
to implement subtraction. (Low bit is a full-adder instead of half-adder, otherwise a normal binary adder.)
The other major problem for bignum of more than 2 reg-widths is doing add with carry-in (on ISAs with a carry flag and add-with-carry instruction).
And even worse, getting carry-out from that 3-input addition. (Either part of which could wrap, so it's not possible AFAIK to combine it into one add and compare. This is a common pitfall of pure-C implementations of adc
; comments on that linked answer have working C, but it doesn't compile very efficiently).
Unless there's some trick I'm not aware of, I assume that's what really has people upset with no-FLAGS designs like RISC-V and MIPS for Bignum stuff.
Update: Řrřola's answer shows pure C that clang will compile to a chain of add/adc/adc/adc
for x86-64 or other ISAs with an add-with-carry instruction. For RV64 (https://godbolt.org/z/oaevGs67q), it looks like we get 2x sltu
and 3x add
for each add step that has both carry-in and carry-out. I don't know if this is optimal or not; clang currently only supports up to _BitInt(128)
for RV64 and I didn't check https://gmplib.org/ mpn
hand-written asm.
Upvotes: 10