Steven
Steven

Reputation: 871

Why do longer pipelines make a single delay slot insufficient?

I read the following statement in Patterson & Hennessy's Computer Organization and Design textbook:

As processors go to both longer pipelines and issuing multiple instructions per clock cycle, the branch delay becomes longer, and a single delay slot is insufficient.

I can understand why "issuing multiple instructions per clock cycle" can make a single delay slot insufficient, but I don't know why "longer pipelines" cause it.

Also, I do not understand why longer pipelines cause the branch delay to become longer. Even with longer pipelines (step to finish one instruction), there's no guarantee that the cycle will increase, so why will the branch delay increase?

Upvotes: 3

Views: 388

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364811

If you add any stages before the stage that detects branches (and evaluates taken/not-taken for conditional branches), 1 delay slot no longer hides the "latency" between the branch entering the first stage of the pipeline and the correct program-counter address after the branch being known.

The first fetch stage needs info from later in the pipeline to know what to fetch next, because it doesn't itself detect branches. For example, in superscalar CPUs with branch prediction, they need to predict which block of instructions to fetch next, separately and earlier from predicting which way a branch goes after it's already decoded.

1 delay slot is only sufficient in MIPS I because branch conditions are evaluated in the first half of a clock cycle in EX, in time to forward to the 2nd half of IF which doesn't need a fetch address until then. (Original MIPS is a classic 5-stage RISC: IF ID EX MEM WB.) See Wikipedia's article on the classic RISC pipeline for much more details, specifically the control hazards section.


That's why MIPS is limited to simple conditions like beq (find any mismatches from an XOR), or bltz (sign bit check). It cannot do anything that requires an adder for carry propagation (so a general blt between two registers is only a pseudo-instruction).

This is very restrictive: a longer front-end can absorb the latency from a larger/more associative L1 instruction cache that takes more than half a cycle to respond on a hit. (MIPS I decode is very simple, though, with the instruction format intentionally designed so machine-code bits can be wired directly as internal control signals. So you can maybe make decode the "half cycle" stage, with fetch getting 1 full cycle, but even 1 cycle is still low with shorter cycle times at higher clock speeds.)

Raising the clock speed might require adding another fetch stage. Decode does have to detecting data hazards and set up bypass forwarding; original MIPS kept that simpler by not detecting load-use hazards, instead software had to respect a load-delay slot until MIPS II. A superscalar CPU has many more possible hazards, even with 1-cycle ALU latency, so detecting what has to forward to what requires more complex logic for matching destination registers in old instructions against sources in younger instructions.

A superscalar pipeline might even want some buffering in instruction fetch to avoid bubbles. A multi-ported register file might be slightly slower to read, maybe requiring an extra decode pipeline stage, although probably that can still be done in 1 cycle.

So, as well as making 1 branch delay slot insufficient by the very nature of superscalar execution, a longer pipeline also increases branch latency, if the extra stages are between fetch and branch resolution. e.g. an extra fetch stage and a 2-wide pipeline could have 4 instructions in flight after a branch instead of 1.


But instead of introducing more branch delay slots to hide this branch delay, the actual solution is branch prediction. (However some DSPs or high performance microcontrollers do have 2 or even 3 branch delay slots.)

Branch-delay slots complicate exception handling; you need a fault-return and a next-after-that address, in case the fault was in a delay slot of a taken branch.

Upvotes: 5

Related Questions