Reputation: 11468
How to efficiently compute 2¹²⁸ % n, where n is uint64_t
(and non-zero)?
If I had access to a narrowing division like _udiv128
intrinsic of MSVC I could do:
uint64_t remainder;
_udiv128(1, 0, n, &remainder);
_udiv128(remainder, 0, n, &remainder);
But that's two slow instructions, not to mention there are CPUs like ARM which do not have the narrowing division. Is there a better way?
Related: How to compute 2⁶⁴/n in C?
Upvotes: 7
Views: 324
Reputation: 64904
The number of narrowing divisions can be reduced to 1 with a simple trick:
uint64_t remainder;
_udiv128(-n % n, 0, n, &remainder);
-n % n
still costs a 64-bit division, but a division where the high half of the dividend is zero tends to be more efficient (certainly more efficient than the "full" libdivide_128_div_64_to_64
, using its slow path, which costs two 64-bit divisions plus a bunch of additional stuff). Some processors don't care as much about a non-zero upper half of the dividend though.
This trick works because 264 is definitely greater than n
, so we can subtract n
from it once, then we get 264-n which is equal to -n
computed modulo 264 (arithmetic on uint64_t is done modulo 264). We can't just put -n
into the high part of the dividend because then the quotient becomes too large and the division faults, but a dedicated 128-bit by 64-bit remainder operation (as opposed to a division that also produces a remainder as a by-product) could support that. So -n % n
it is.
The _udiv128
intrinsic could be replaced with a call to libdivide_128_div_64_to_64
to support other platforms.
It seems plausible to me (given the special nature of this case) that there are more tricks, but I don't know them and couldn't find them. A common recommendation that I encountered was to use exponentiation by squaring, but if taken literally that would first take 6 steps just to get to -n % n
(we may as well start with that and then do only one square), and doesn't solve the biggest problem: how to do what _udiv128
does without _udiv128
. Squaring -n % n
and then reducing it mod n
doesn't seem any better to me than multiplying -n % n
by 264 and reducing it mod n
, it just costs an extra multiplication (with 128-bit output) and then leaves us with an equally annoying problem.
Upvotes: 8