How does MIPS r10000 fetch hide instruction cache latency?

Question

I am studying different pipeline stages of mips r10000. The paper says that processor fetches 4 instructions per cycle from instruction cache each time. But the latency from the instruction cache must be more than one cycle, though I don't know the exact hit latency of instruction cache, hit latency of L1 data cache in Haswell processor is about 4 cycles.

So if we assume L1 instruction cache latency is 3-4 cycles how can the processor fetch 4 instructions each cycle?

user2467198 · Accepted Answer

The MIPS R10000 had a single cycle latency instruction cache and could fetch a contiguous block of four instructions within a cache block without alignment constraints.

Mechanically, this probably meant that it used four SRAM banks with at least partially independent addressing (the cache set address decode could be shared).

Since each bank is independently addressable, as can be seen in the diagram any contiguous sequence of four words contained in the sixteen words can be accessed. Addressing rows [0, 0, 0, 0] gets words [0, 1, 2, 3] (words 0-3); rows [1, 0 , 0, 0] gets words [4, 1, 2, 3] (words 1-4); rows [1, 1, 0, 0] gets words [4, 5, 2, 3] (words 2-5); ...; rows [3, 3, 3, 2] gets words [12, 13, 14, 11] (words 11-14); rows [3, 3, 3, 3] gets words [12, 13, 14, 15] (words 12-15).

(The same banking could cross cache block boundaries, but then two cache block hits would have to be confirmed in parallel. Memoization of the way for the previous access would reduce this to one set check for a common case of sequential accesses in largish cache blocks; one set would use the memoized way and the other would perform the normal check when entering a new cache block. Page crossing is a similar issue.)

(A common alternative for multiple instruction fetch does have an alignment constraint of a naturally aligned chunk of, e.g., 16 bytes.)

This processor did not redirect instruction fetch until a branch was detected in the second pipeline stage (decode), so a taken branch introduced a one cycle bubble even with a correct prediction. An incorrect prediction might not be determined until some cycles later because execution started in the fourth pipeline stage and instructions were executed out-of-order. (An incorrectly predicted taken branch could decode instructions already fetched in the taken branch bubble as these were stored in a "resume cache".)

Buffering of instructions can smooth out such hazards as throughput rarely approaches the maximum because of data dependencies and other hazards.

In general, a cache can provide multiple words per fetch (a natural alignment restriction facilitates a single bank providing the chunk) or be accessed multiple times per cycle (e.g., more deeply pipelining the instruction cache than other parts of the pipeline or using expensive multiported SRAM).

As long as a new address is provided every cycle, a fetch of multiple contiguous instructions can be done every cycle. If two addresses are available (predicted) per cycle, instructions after a taken branch could be fetched in the same cycle. (Another method of reducing the taken branch penalty — and providing other post-branch optimization opportunities — is to use a trace cache.)

How does MIPS r10000 fetch hide instruction cache latency?

Answers (1)

Related Questions