Reputation: 11558
In my Intel x86 Pentium handbook it says that ADD and shifts like SAL/SHR take 1/3 clock compared to things like JMP and MOV that take 1 clock. Is this really true that a bunch of adds and shifts will 3 times faster than a bunch of movs?
I guess I am doubly confused because there is table of "latencies" on the web showing "Pentium M" and none of timings are 1/3, although a few are 1/2. Is this because my book is old and on newer Pentiums shift is the same speed as JMP?
Upvotes: 3
Views: 857
Reputation: 490278
Assuming this is about the original Pentium (i.e., not a Pentium Pro or newer) the 1/3
does not mean "one third" (or anything like it). It means the instruction has 1 cycle throughput and 3 cycle latency (i.e., you can start one instruction every cycle, and one can finish every cycle, but the instruction takes three pipeline stages, so there's a three-cycle delay between starting and finishing a particular instruction).
The original Pentium had only two execution units and no out of order execution. In a given clock cycle, the next instruction would execute in the U pipeline. If the right conditions were met, the instruction after that could execute in the V pipeline. Under no circumstances did more than two instructions execute in any given cycle, and under no circumstance did more than one instruction execute per clock in a single pipeline.
Later processors (starting with the Pentium Pro) added out of order instruction scheduling, and the ability to execute more than two instructions in a single cycle (could have considerably more "in flight" but was limited to retiring three per cycle). The Pentium IV added the ability to execute 2 extremely simple instructions (register to register AND, OR, NOT, ADD, SUB, single-bit shift) in the same execution unit in a single clock cycle (i.e., it had an execution unit that was actually running at double the rated clock speed, so for example, on a 2.8 GHz processor a small amount of the circuitry was actually running at 5.6 GHz).
Upvotes: 3
Reputation: 471339
Don't confuse "latency" with "reciprocal throughput".
That 1/3
that you are seeing isn't the latency. It's the reciprocal throughput. The processor can sustain 3 ADDs per cycle. (if they are all independent) But each one still takes at least 1 cycle to execute.
If you have latency 1
and reciprocal throughput of 1/3
, that means that the processor can execute up to 3 ADDs simultaneously. But each one still takes 1 cycle.
Historically, most Intel processors (since Pentium?) have 3 main execution units that can all do basic operations such as additions and shifts. That's why most of these are 1/3
reciprocal throughput.
Register-to-register MOVs should also be 1/3
. But MOVs that touch memory (ie. loads and stores) are historically only 1/cycle. (Recently with Sandy Bridge and later, this has been increased.)
Upvotes: 10