Reputation:
I want to learn how to reason about the speed of assembly. I implemented a few different ways to reverse the bits of a byte. One uses an algorithm:
rev_u8_a:
movzx eax, dil
mov ecx, 2149582850
imul rcx, rax
movabs rdx, 36578664720
and rdx, rcx
movabs rax, 4311810305
imul rax, rdx
shr rax, 32
ret
and the other uses a lookup table:
rev_u8_d:
movzx eax, dil
lea rcx, [rip + .L__unnamed_1]
movzx eax, byte ptr [rax + rcx]
ret
The first is 8 instructions and the second is 3, but the first is register only and the second accesses memory. How do I go about reasoning about which might be faster? Which is faster in these two pieces of code?
Upvotes: 0
Views: 105
Reputation: 50816
This is far from being easy to say which one is the fastest because modern processors are insanely complex, especially x86-64 ones.
First of all, processors execute instruction in a pipelined way so to overlap operations like fetching an instruction, decoding it, executing it. Some instruction can take more cycle to complete than others. For example, a imul rcx, rax
takes 3 cycle on Skylake to complete while a and rdx, rcx
takes 1 cycle. It gets complex when multiple instructions are executed. Indeed, modern processor can execute multiple instruction in parallel during a given cycle. To do that, they track the dependencies between instructions. Some instruction can have a high latency but a news instruction can be issue every cycle while some other cannot be pipelined so well. For example, 3 imul
operating on different registers takes about 5 cycles on Skylake to complete while 1 imul
take 3 cycle. Regarding how a code is used, it can be more important to reduce its latency or the number of instruction. In fact, this is even more complex since x86-64 instructions are split into micro-instructions aka uops. Some instructions can be fused (macro-fusion) so to generate 1 uop and 1 instruction can generate one or multiple uops. There is an entire part of the processor dedicated for fetching instructions, decoding them, converting them into uops, etc: the front-end. Another part is responsible for the execution of the instructions and their scheduling on many execution units: the back-end. On most processors, the execution units are not all the same so scheduling instruction is pretty complex, especially since the processor should focus on the critical path so to execute the instruction in a minimum number of cycles. In practice uops are scheduled on many ports and each port can execute a restricted list of kind of uops. There are many other things to consider: register can be renamed so to avoid/break some false-dependencies and increase the amount of parallelism, instructions are executed in an out-of-order way so to increase the amount of parallelism and reduce stalls, memory accesses and branches are predicted (see for example speculative execution), micro-instructions are cached, multiple threads can simultaneously run on the same core and share resources (see SMT), uops can be fused too, etc. All this just scratches the surface of how modern processors work. The complexity of modern x86-64 processors is similar to huge human cities.
There are tools like LLVM-MCA or uiCA (which is apparently better according to @PeterCordes) to analyse relatively large code and understand how the processor can execute the instructions in practice. None of them is even close to be good in practice because there are many things to consider at runtime. For example a code can be fast when the cache is hot be slow when it is cold. The size of a code matters as well as the memory access pattern. Some algorithms implemented in-silico are often not publicly documented (eg. prefetchers for example). Still, tool like LLVM-MCA can be good to understand what can approximately happen in simple code (without considering loops, branch miss-prediction, cache misses, and so on).
In practice, the first code can be faster when the cache is cold since fetching data from the DRAM usually takes several dozens of nanosecond (while the time to execute the first code is likely to be far less than that). When the cache is hot, that is data are in the L1 cache, the second one can be typically faster (less uops and a lower latency). Thus, if the LUT is small and you use this code in a loop, then the second code is likely to be faster. If the code is executed once, it might be a better solution to use the former. Things are a bit more complex if multiple threads are running on the same core. The performance is tightly related to the target processor: some processor have 3 load ports while others may have only 1, the latency of the L1 cache can be 3 cycle on one processor and 4 on another, the latency and reciprocal throughput can be very different between two different architecture (eg. Apple M1 VS ARM Cortex-A53, or Intel Skylake VS AMD Excavator). Honestly, the best thing to do is to write several benchmarks fitting with your real-world use-cases. You should be very careful while writing benchmarks since a bad benchmark can result in drastically different conclusions. To start with, you can read: Idiomatic way of performance evaluation?.
Upvotes: 2