Performance of custom bitwise vs native CPU operations

Question

everyone! I have been trying to create my own big integer class for RSA implementation in C++ (for practice purposes only). The only way I see such thing to be implemented well in terms of performance is by using C++'s built-in bitwise operations (&|^), meaning implementing custom full-adders for addition, binary multipliers for multiplication, etc.
The thing I am interested in can be formulated as follows: would custom-made emulators of hardware circuits for number arithmetic (like full-adders, multipliers) using bitwise C++ operations be slower in terms of performance. In other words, if I make my own unsigned integer class of the size of 64 bits, and make it "ideal" in terms of number of bitwise operations needed to perform addition, multiplication, division on it, can it have the same performance as built-in unsigned long long?
Can it be implemented to be this fast with any programming language at all or you never will be able to make any operation faster than those in CPUs' intrinsic instruction set?
Please note that I am not interested in answers regarding implementations of RSA, but only in performance comparison of native and hand-made arithmetic.
Thank you in advance!

user555045 · Accepted Answer

Software implementations cannot even come close to arithmetic circuits that are typically used in hardware. It's not even a question of benchmarking or of different results depending on the system (assuming we're talking about hardware that isn't prehistoric), it's a hands-down win for hardware circuits, guaranteed, every time, by a huge factor. A big difference between hardware and software is this: hardware has almost unlimited parallelism, so the number of bit-operations doesn't matter as much, speed depends primarily on the "depth" of a circuit. Software also has parallelism, but it's very limited.

Consider a typical fast multiplier, as used in modern hardware. They're based on some parallel reduction scheme, such as a Dadda multiplier (the actual circuit doesn't necessarily follow Dadda's algorithm to the letter, but it's going to use a similar parallel reduction). Hardware has almost unlimited parallelism to do that parallel reduction with. As a result, 64bit multiplication takes 3 cycles on many modern machines (not all of them, but for example both Apple M1 and all current Intel and AMD x64 processors). Granted, in 3 cycles you could squeeze more than 3 bitwise operations, but it's just not even a contest - you cannot implement multiplication in just a handful of bitwise operations.

Even just addition is already unbeatable. It's already as fast as bitwise operations are, or perhaps more accurately, bitwise operations are as slow as addition is. The time a bitwise operation takes at the software level has little to do with the latency of the corresponding gate, it's more a property of how the processor was designed in general. By the way you may also be interested in this other question: Why is addition as fast as bit-wise operations in modern processors?

Performance of custom bitwise vs native CPU operations

Answers (2)

Related Questions