Load/Store unit and in pipeline

Question

I am learning about CPU and pipeline. We know In processors pipeline execute stage exist some load store units.

Now I want to know that if we have a load instruction that take a long time because requested data did not exist in cache, next load instruction won't execute until first load resolved in out of order cpu unless we have another load/store unit in pipeline execute stage?

What is the corresponding behavior of an in-order CPU?

Peter Cordes · Accepted Answer

Will a cache-miss load block later loads? CPUs can be designed to avoid it.

The terms you're looking for are hit under miss for cache accesses while waiting for an outstanding cache miss, and miss under miss for having multiple cache misses outstanding (memory level parallelism).

See https://www.ece.ucsb.edu/~strukov/ece154BSpring2018/week3.pdf for Nonblocking cache and pipelined cache.

In an out-of-order CPU this can be handled by having the load instruction give temporarily, and replay again later from the scheduler. That's what Intel x86 CPUs do, at least. CPUs with a weakly-ordered memory model like ARM might only need to have a load buffer wait for that cache line to arrive at L1d cache without having to rerun the load instruction.

On an in-order pipeline, loads can be scoreboarded so nothing stalls until you try to actually read a register that was written by a cache-miss load. You need some load buffers to wait for incoming cache lines, but then hit-under-miss and miss-under-miss just work.

(Paul Clayton posted an answer with slightly more detail on https://electronics.stackexchange.com/questions/98551/what-happens-on-a-cache-miss)

Remember that in-order only means instructions have to start executing in program order. High-latency instructions like division, or especially stores and loads, can complete out of order. Even cache hit stores have enough latency on a modern high-speed in-order design that you want to let other instructions run in the shadow of that latency.

Different complexity levels of pipelines might or might not support hit-under-miss and/or miss-under-miss. i.e. different amounts of MLP (Memory Level Parallelism).

Out-of-order CPUs almost always support both; one of the major benefits of OoO exec is hiding memory latency, as well as ALU latency.

Cortex-A8's glossary mentions Hit Under Miss, but IDK where they actually use the term to specifically say that A8 supports it or not.

I did find 16.4.2. Memory system effects on instruction timings which says:

Because the processor is a statically scheduled design, any stall from the memory system can result in the minimum of a 8-cycle delay. This 8-cycle delay minimum is balanced with the minimum number of possible cycles to receive data from the L2 cache in the case of an L1 load miss. Table 16.16 gives the most common cases that can result in an instruction replay because of a memory system stall.

The list of conditions includes an L1 Load data miss. And a subsequent L2 miss can result in another replay, delaying later instructions for another 8 cycles.

This really sounds like A8 does not support anything under a cache miss, and instead basically stalls.

Load/Store unit and in pipeline

Answers (1)

Related Questions