AAA
AAA

Reputation: 191

Data Hazards and stalls

I am studying for my exam tomorrow and I am having difficulty in the below code :

sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

Due to the ALU-ALU dependency here on Register $2 , The sub instruction does not write its result until the fifth stage, meaning that we would have to waste three clock cycles in the pipeline. My question is why 3 clock cycles ? This dependency can be solved by inserting two nops and therefore we are wasting 2 clock cycles ? Please clarify it to me as I am trying to relate the nops to the wasted cycles and I am sure that I have a huge misunderstanding here .

Upvotes: 4

Views: 1309

Answers (2)

Erik Eidt
Erik Eidt

Reputation: 26646

The names of the pipeline stages are somewhat less than standard.  More common is to use IF (instruction fetch from Instruction Memory IM), ID (instruction decode and register read), EX (execute/ALU), MEM (Data Memory read or write), and WB (write back register result).

Whether it is 2 vs. 3 clock cycles depends on your internal architecture.

When designed for 3 clock cycle approach, the processor will have to stall for 3 cycles so that the reg write (WB) of the first instruction (cycle 5) fully completes before the reg read (ID) of the second instruction can see those up-to-date values.  Without other mitigation, the reg read stage of the second instruction cannot always get guaranteed proper values until clock cycle 6 (whereas without the dependency/hazard, the second instruction would have preferably done reg read (ID) in cycle 3 and was delayed until 6, so 6-3=3 cycle delay).

When designed for 2 clock cycles, that means the the reg write (WB) stage, stage 5 of the first instruction in clock cycle 5, overlaps with the decode (ID) reg read stage of the second instruction also in clock cycle 5.

The reason this works in one of the two ways following:

  1. Half cycles

    • The WB stage is very quick, and, its data is ready at the absolute beginning of the clock cycle (nothing has to be computed at all) — it effectively is completed in the first half of cycle 5 (this example).
    • The ID stage is slow and gets the data in the second half of cycle 5 — thus, it obtains data that is up-to-date as of reg writes that happen in this cycle.
  2. Register forwarding

    • The WB stage's data is available at the absolute beginning of the clock cycle, though WB takes the full cycle.
    • The ID stage accesses the register file, but the register file knows about the write happening in this cycle and has an internal forward to ensure that reads read either register values or new values that are being written.  The register file effectively has an internal bypass allowing written values to be read in the same cycle.

We speak of these stalls as conditional — if we didn't then every instruction would stall for 2-3 cycles just in case the next subsequent instruction used the ALU result.  This would almost (but perhaps not quite) negate the benefit of pipelining.

Let's note that I assert that the conditional logic to detect when we do need a stall due to ALU/ALU dependency is about as complicated as the logic to do a ALU/ALU bypass (because they are the same test).  Since the bypass totally mitigates the performance issue, a designer would always prefer the bypass.

The idea of the bypass is that the value needed in the second instruction's EX/ALU stage is actually available somewhere in the CPU — as it has been computed already in the prior clock cycle (that value is just not in the right place).  The problem is that the reg read (ID) for the second instruction obtained stale values: refreshing them (ignoring those stale values and taking them directly from the ALU output) when appropriate is the bypass solution.

So, I find it strange to talk about stalls for ALU/ALU dependency/hazard, when mitigating solutions exist (e.g. bypass), I don't think even the original MIPS required software NOP mitigation for an ALU/ALU hazard.  (It did for load followed by use, which is a MEM/ALU hazard that requires both a bypass and a stall, the later of which was not provided in the original MIPS, so software had to ensure the use was separated from the load by at least one instruction possibly by inserting a NOP).

Upvotes: 2

Michael Dorgan
Michael Dorgan

Reputation: 12515

From https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec12.pdf

Basically, You are losing 2 cycles to the first instruction affected by the hazard (the and) and 1 cycle to the following (the or). The stall here hits the whole pipeline, not just the following instruction.

See pages 8 - 10 of the PDF for a picture of this. Here is a quick ASCII of it:

     1   2   3   4   5  
sub IM  Reg ==> DM  Reg 
and     IM   X   X  Reg 
or          IM   X  Reg 

Where the Xs represent stalls. Note that the and and the or are stalled at the 2nd stage of their pipeline awaiting the result from stage 5 of the sub instruction.

Upvotes: 0

Related Questions