user15563851
user15563851

Reputation:

Assembly why lea is fast?

I had a conversation with my professor and he said:

leaq (%rax,%rax,8)

Is faster than:

imulq $9, %rax

I asked him why (in both cases we are doing multiplication with nearly same numbers) and he said we won't get into that.

Can Someone help me understand in simple way why leaq is fast in general?

a question that rose from the comments, is:

imulq $9, %rax

faster than doing 2 commands, one to shift left and other to add one %rax (which we could previously save in a register)

and why?

Upvotes: 2

Views: 2039

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365981

LEA is just a shift-and-add instruction, with a 2-bit shift count as part of the machine encoding. That's much cheaper to build in hardware than a full 64-bit multiplier, and why CPUs can easily have multiple execution units that can handle LEA uops. (For better than 1/clock throughput).

Note that LEA latency is 1 cycle only for simple-enough addressing modes (before Ice Lake). On Intel SnB-family CPUs, there aren't any uops with 2-cycle latency, and LEA with 3 components (two + operations) has 3 cycle latency. Apparently Intel couldn't or didn't fit enough gate-delays for 2 additions (or a 3->2 reduction and one addition) into a single ALU cycle until Ice Lake.

But yes, simpler LEAs like the one in the question (with no displacement) are 1 cycle latency and throughput of 2/clock on SnB-family, with "slow" LEAs only running on port 1 (the only execution port on SnB-family that can run integer uops with latency other than 1.)

Ice Lake is always 1c latency, 1 uop. 2/clock throughput for addressing modes including a scaled-index (shift-count != 0), or 4/clock otherwise. (Even for 3-component operations like lea 1(%rax, %rcx), %edx that would be a "slow LEA" on Skylake or Zen).

On AMD, lea is 1 or 2-cycle latency, with similar throughput reduction (fewer ports) for slow LEA. And the conditions for being fast are more restrictive: a scale factor other than 1 makes it slow. But Zen still has 2 execution units that can handle "slow" LEAs, 4 for fast LEAs. https://uops.info/ https://agner.org/optimize/


is imulq $9, %rax faster than doing 2 commands, one to shift left and other to add one %rax (which we could previously save in a register)

imul $9, %rax is 1 uop, 3c latency, 1/clock throughput on AMD since Zen, Intel since Nehalem. (https://uops.info/). Higher latency on older CPUs, especially for 64-bit operand-size.

shl $3, %rax / add %rcx, %rax is 2 uops for the front-end, but only has 2 cycle latency. (And probably an extra mov somewhere before that, for a 3rd uop).

However, any decent compiler would use lea (%rax, %rax, 8), %rax instead (a*9 = a + a*8) : 1 uop, 1c latency on Intel, 2/clock throughput so it's not worse in any way, and better in many ways. (Or at worst, 2 cycle latency on AMD because of the scaled index, but that's still better than imul.)

When you're looking at a single instruction or short sequence, performance isn't one dimensional, but rather 3: front-end uops, back-end ports, and latency for the critical path. There's no single-number cost you can add up across instructions to find out how long a block of instructions will take; the whole point of superscalar out-of-order execution is to find instruction-level parallelism, whose existence depends on how instructions use each other's results. (But sometimes you can say that one sequence is at least as good as another in every way, if it's the same or better in all 3 ways across all existing CPUs.)

Upvotes: 1

mkayaalp
mkayaalp

Reputation: 2736

The lea (load effective address) is a way to perform the common operation of pointer arithmetic. How an instruction refers to its operands is called its addressing mode and lea supports scaled or base plus index plus offset addressing modes (among others).

address = base address + index * scaling + offset

where the scaling value can be one of few powers of two (1, 2, 4, 8). These values are useful for arrays of bytes, characters, integers, pointers etc. It is not capable of encoding or performing multiplication with arbitrary values. In the hardware, these few options can be implemented with a couple of multiplexers, with a fraction of a cycle of delay.

A multiplication instruction on the other hand, goes through a multiplication circuitry that can multiply two arbitrary full-width (64 bits) operands. This is an operation with significantly higher complexity. Even with multiple full-width adders in parallel, it has about six times (log n) the delay of a full-width addition (although the design might incorporate an optimization that allows it to multiply simpler values quicker).

Upvotes: 4

Related Questions