zbqv
zbqv

Reputation: 79

Computer Architecture pipeline stalls

First of all, sorry for my poor English. The question is a problem in the textbook for my Computer Architecture course. I've found the answer on the net but still cannot find out the details.

The following is the phase of instructions in a five-stage (fetch, decode, execute, memory, write) single-pipeline microarchitecture without forwarding mechanism. All operations are one cycle except LW and SW are 1 + 2, and Branch is 1 + 1.

Loop:             C1  C2  C3  C4  C5  C6  C7  C8  C9  C10 C11 C12 C13 C14 ...
LW   R3, 0(R0)    F   D   E   M   -   -   W
LW   R1, 0(R3)        F   D   -   -   -   E   M   -   -   W
ADDI R1, R1, #1           F   -   -   -   D   -   -   -   E   M   W
SUB  R4, R3, R2                           F   -   -   -   D   E   M   W
SW   R1, 0(R3)                                            F   D   W   M   ...
BNZ  R4, Loop                                                 F   D   E   ...
...

I have several questions:

  1. Why can the 2nd instruction start D in C2? As I have learned, D-stage include "register read", but the previous instruction doesn't write back to R3 until C7.

  2. Similar to previous one, what are the reasons that cause the 3rd inst's D to start at C7, and E to start at C11?

  3. Why must the 4th inst start at C7 instead of C4?

This problem originate from the book "Computer Architecture : A Quantitative Approach 5e", example 3.11.

Upvotes: 2

Views: 1486

Answers (3)

Jaymin Jasoliya
Jaymin Jasoliya

Reputation: 11

Why the 2nd instruction can start D in C2? D includes reg-read, but the previous instruction doesn't write back to R3 until C7.

Ans - this is because DECODER stage in MIPS simple pipleline has two parts/sub stages.

DEC = DECODE + RR( REGISTER READ )

instruction can be decoded that is, the opcode can be read and decoded but, due to dependency in this case, RR will stall till first load instruction executes(fetches R3 from memory) and with simple forwarding in C7 next load can go to execution.

the breaking of DECODE stage in two substage is done to avoid structural hazard. if you read diagram the "Computer Architecture : A Quantitative Approach" again, you will see a dotted line and solid line that is intentionally drawn to show overall work is decided in 2 parts(DECODE OPCODE + REG READ).

The other two question i agree with @Peter Cordes

hope this helps. Jaymin

Upvotes: 1

Peter Cordes
Peter Cordes

Reputation: 364059

The Classic RISC pipeline wiki article is very good. Check it out if you haven't.

  1. Why the 2nd instruction can start D in C2? D includes reg-read, but the previous instruction doesn't write back to R3 until C7.

I'm not sure, I haven't spent a lot of time on the classic-RISC pipeline. Based on what we see for this and ADDI, it looks like register-read happens in the E stage.

That perfectly explains E stalling until the previous load's write-back. If you're sure reg-read is supposed to happen in the D stage for the pipeline you're studying, then this solution doesn't match your pipeline; it's correct for a different pipeline that doesn't read registers until Execute.

3rd inst's D start at C7, and E start at C11?

The D stage of the pipeline is occupied by the previous instruction until C7, at which point it can decode.

R1 isn't ready until cycle 11, at which point the data can be forwarded from the memory stage of the previous instruction, so the ADDI's Execute can happen in parallel with Writeback in the previous instruction. This is called a "bypass".

A bypass can let ALU operations run with 1 cycle latency, so you could use the output of an ADD in the next instruction without a stall.

  1. Why 4th inst must start at C7 instead of C4?

Because the previous instruction is stalled in the fetch stage, and it's an in-order pipeline; no out-of-order execution.

Upvotes: 0

Isuru H
Isuru H

Reputation: 1221

It looks like your pipeline freezes the entire systems when it does a memory related operation (LW), other than than I cannot think of a valid reason why ADDI cannot perform its Decode in C4. I am not saying its valid for a load operation to freeze the whole execution, but that seem to be the "only" logical explanation.

Instruction 2 can perform its decode in C3, but it has to wait until instruction 1 has write back its data to R1. Thats why the execution of the second instruction is delayed until C7.

BTW when you said you've found the answer on the "net" is it from a credible source?

Upvotes: 1

Related Questions