Reputation: 191
I am studying for my exam tomorrow and I am having difficulty in the below code :
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Due to the ALU-ALU dependency here on Register $2 , The sub instruction does not write its result until the fifth stage, meaning that we would have to waste three clock cycles in the pipeline. My question is why 3 clock cycles ? This dependency can be solved by inserting two nops and therefore we are wasting 2 clock cycles ? Please clarify it to me as I am trying to relate the nops to the wasted cycles and I am sure that I have a huge misunderstanding here .
Upvotes: 4
Views: 1309
Reputation: 26646
The names of the pipeline stages are somewhat less than standard. More common is to use IF (instruction fetch from Instruction Memory IM), ID (instruction decode and register read), EX (execute/ALU), MEM (Data Memory read or write), and WB (write back register result).
Whether it is 2 vs. 3 clock cycles depends on your internal architecture.
When designed for 3 clock cycle approach, the processor will have to stall for 3 cycles so that the reg write (WB) of the first instruction (cycle 5) fully completes before the reg read (ID) of the second instruction can see those up-to-date values. Without other mitigation, the reg read stage of the second instruction cannot always get guaranteed proper values until clock cycle 6 (whereas without the dependency/hazard, the second instruction would have preferably done reg read (ID) in cycle 3 and was delayed until 6, so 6-3=3 cycle delay).
When designed for 2 clock cycles, that means the the reg write (WB) stage, stage 5 of the first instruction in clock cycle 5, overlaps with the decode (ID) reg read stage of the second instruction also in clock cycle 5.
The reason this works in one of the two ways following:
Half cycles
Register forwarding
We speak of these stalls as conditional — if we didn't then every instruction would stall for 2-3 cycles just in case the next subsequent instruction used the ALU result. This would almost (but perhaps not quite) negate the benefit of pipelining.
Let's note that I assert that the conditional logic to detect when we do need a stall due to ALU/ALU dependency is about as complicated as the logic to do a ALU/ALU bypass (because they are the same test). Since the bypass totally mitigates the performance issue, a designer would always prefer the bypass.
The idea of the bypass is that the value needed in the second instruction's EX/ALU stage is actually available somewhere in the CPU — as it has been computed already in the prior clock cycle (that value is just not in the right place). The problem is that the reg read (ID) for the second instruction obtained stale values: refreshing them (ignoring those stale values and taking them directly from the ALU output) when appropriate is the bypass solution.
So, I find it strange to talk about stalls for ALU/ALU dependency/hazard, when mitigating solutions exist (e.g. bypass), I don't think even the original MIPS required software NOP mitigation for an ALU/ALU hazard. (It did for load followed by use, which is a MEM/ALU hazard that requires both a bypass and a stall, the later of which was not provided in the original MIPS, so software had to ensure the use was separated from the load by at least one instruction possibly by inserting a NOP).
Upvotes: 2
Reputation: 12515
From https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec12.pdf
Basically, You are losing 2 cycles to the first instruction affected by the hazard (the and) and 1 cycle to the following (the or). The stall here hits the whole pipeline, not just the following instruction.
See pages 8 - 10 of the PDF for a picture of this. Here is a quick ASCII of it:
1 2 3 4 5
sub IM Reg ==> DM Reg
and IM X X Reg
or IM X Reg
Where the Xs represent stalls. Note that the and
and the or
are stalled at the 2nd stage of their pipeline awaiting the result from stage 5 of the sub
instruction.
Upvotes: 0