Reputation: 31
I'm writing some C code for a research project in number theory, which requires to do a lot of operations in modular arithmetic, with many different moduli. To put it simple: I need to do the operation (a * b) % n
many many times.
The code is meant to run on a PC with 64 bits words, and all the moduli are known to be less than 2^64, so all the operands are implemented by unsigned 64 bits integers.
My question is: Would using Montgomery modular multiplication (that makes use only of addition and multiplication) instead of the C modulo operator %
(that translates to a % n = a - n*(a / n)
and uses also a division) result in a faster execution?
Intuitively, I would say that the answer is: No, because (word-size) divisions on a PC are not too much computationally expensive than (word-size) multiplications, and Montgomery reduction would actually cause an overhead.
Thank for any suggestions.
Update: On the one hand, according to Paul Ogilvie (see his comment below), (a * b) % n
requires 1 multiplication and 1 division. On the other hand, Montgomery multiplication requires (ignoring the operations needed to convert, and convert back, the operands to their Montgomery representations, since they are done once time only for every modulo n; and the binary shifts) 3 multiplications. So it would seem that Montgomery is faster than ``%'' as soon as multiplication is performed two times faster than division...
Upvotes: 3
Views: 452
Reputation: 570
from https://www.agner.org/optimize/instruction_tables.pdf
64 bit multiplication on skylake : 3 or 4 cycles (pipelined)
64 bit division on skylake : 42 to 95 cycles (not pipelined)
Upvotes: 0
Reputation: 64904
all the moduli are known to be less than 2^64, so all the operands are implemented by unsigned 64 bits integers.
However, a * b
is 128 bit which complicates the story. div
takes a 128 bit dividend and a * b / n < n
so it cannot overflow the division (that would imply out-of-range inputs) so that's trivial to write in x64 assembly:
; compute (A * B) % N
; A: rax
; B: rdx
; N: rcx
; result: rdx
mul rdx
div rcx
And in C the above is impossible to write, except with some special things such as __uint128_t
or _mul128
and _div128
. However you get that code to appear, this form of div
is the slowest possible form, look for ":DIV r64 128/64b (full)" in for example Haswell instruction timing dump. Nearly a hundred cycles, on pre-IceLake CPUs this is basically worse than anything else you can do, except implementing bit-by-bit division yourself. Ice Lake is different and finally has decent integer division, at 18 cycles (add +4 for the initial mul
for the overall modmul) it's still not fast but at least it's not an order of magnitude off the mark and perhaps worth considering (including buying the new hardware) because:
with many different moduli
This can break everything, depending on how many is many. Montgomery multiplication requires finding some funny modular inverse, modulo a power of two so you can at least use Hensel lifting instead of Extended Euclidean, but it still costs a dozen multiplications plus some extra stuff, all serial. Similarly, Barrett reduction requires finding some funny fixed-point reciprocal approximation, which is a fancy way of saying it requires a division up front. With too many different moduli, the tricks are useless.
Upvotes: 0
Reputation: 7973
Your intution is incorrect. Division is many times slower than multiplication, whether it is for integers or floating points. See this excellent answer about a similar question. The exact difference in speed depends on which CPU you are running on, whether the code can be vectorized, and even on what the rest of the code is doing around the same time.
If you do integer divide by a constant, for example if you know n
at compile time, then the compiler could transform this into a sequence of multiplications and shifts, maybe even doing exactly the same as Montgomery modular multiplication. If n
is not known at compile time, then it is probably worthwile to implement Montgomery modular multiplication.
However, the best answer you will get is when you implement both versions of your code, and benchmark it.
Upvotes: 3