precision
precision

Reputation: 303

If the number of fetched instructions per cycle is constant for out-of-order superscalar processor?

I would like to know if the number of fetched instructions per cycle for an out-of-order superscalar processor (let's assume an Intel i7 processor) is constant or it may change based on the cache miss rate or number of branch miss predictions of a given code/program?

If it is not constant, how to explain the reason behind it? As I know, In modern multi-core processors, decoder unit always try to resolve dependencies and try to fill pipeline bubbles with independent instructions. So, the number of fetched instructions should not be always same (approximately ) for any given workload?

Upvotes: 2

Views: 592

Answers (2)

Olof Forshell
Olof Forshell

Reputation: 3274

To illustrate this take a look at this document that shows the ability of different architectures to execute instructions of a certain type in "parallel." As you can understand, combining instructions of one type of latency with instructions of other will result in the CPU adapting to the mix in variable ways. Factor in that they may depend on the same registers, cache misses, branch mispredictions (or other, less obvious factors) and the interaction becomes even more complex.

Upvotes: 1

user2467198
user2467198

Reputation:

The number of instructions fetched on a given cycle is dependent on multiple factors. For Intel's fourth generation Core processors, when using the instruction cache rather than the µop cache, an aligned 16 byte block of instructions are fetched each cycle. From this chunk up to six instructions can be parsed and placed in an instruction queue (which can hold up to 20 instructions from a thread). Up to five instructions from this queue can be decoded if two instructions (a.k.a. macro-ops) can be fused, the first instruction decodes into no more than four fused µops, and the remaining three instructions decode into single fused µops. The resulting µops are stored in a 56-entry µop Decode Queue (which also acts as a loop buffer). (Instructions that decode into more than four µops use a special microcode engine.)

Since x86 has variable length instructions (up to 15 bytes long), the number of instructions in a 16-byte chunk can vary. In addition, with taken branches, the target of the branch might not be aligned to a 16 byte chunk and the branch instruction might not end on the last byte of the chunk; this means that bytes at the beginning of a chunk with an unaligned taken branch target will be ignored and bytes within a chunk after a taken branch will be ignored.

(In some other microarchitectures, a taken branch can result in a cycle in which no (useful) instructions are fetched. If the branch target buffer and instruction cache have two cycle latency, then on a taken branch the cycle after the branch instruction begins to be fetched will not have the target available to fetch the following instructions.)

If there is an instruction cache miss, no instructions can be fetched from that thread until the missing cache line becomes available. Similarly, a TLB miss must be serviced before further fetching can be done from the instruction cache.

The µop cache has different constraints on the number of instructions fetched per cycle. Four µops can be read from the µop cache each cycle. This can correspond to one instruction or (with macro-op fusion) more than four instructions. Since the µop cache is virtually addressed, a TLB miss will not stall reads (though a TLB miss would be unlikely given µop cache hit).

(Four µops can move from the µop Decode Queue to the 60-entry scheduler each cycle.)

With a branch misprediction, since the pipeline is flushed, none of the instructions fetched after the branch will contribute to the count of effective instructions fetched. While instructions will be fetched (and some quite possibly executed) before the branch misprediction is detected, they will not contribute to the number of instructions committed.

In addition, there is a limited amount of buffering of instructions. If µops that are dependent on a load that had data cache miss the scheduling buffer can fill up, which then leads instructions accumulating in the µop Decode Queue (because that queue will no longer be drained) and then the instruction queue just after fetch will quickly fill since it cannot drain into the µop Decode Queue.

The reorder buffer (ROB) places another limit on instructions leaving the µop Decode Queue; when the ROB is full, no more instructions can be moved into the scheduling buffer. This can happen if the oldest instruction has not completed even if all the following 191 instructions have completed and are ready to commit.

Even without data cache misses, dependencies between operations can cause the buffers to fill, leading to the stalling of instruction fetch.

As you might guess, having a second thread can facilitate a higher effective instruction fetch rate by reducing the impact of branch predictions (effectively only half the instructions are flushed from the pipeline) and providing more instruction-level parallelism (since instructions from separate threads will be independent) which allows operations to execute and eventually commit, draining various buffers.

Since there is such substantial buffering of instructions and most software does not fully utilize the execution width consistently, there is less pressure to fetch as many instructions as could potentially be executed per cycle. High quality branch prediction also means that more fetched instructions will actually be used. (On a branch misprediction, wider fetch would more quickly fill up the scheduling buffer increasing the chance of independent operations being available. Since multiple threads increase the amount of instruction-level parallelism available, such also provides an incentive for wider fetch, but it also reduces the frequency and cost of fetch stalls countering that incentive.)

Upvotes: 5

Related Questions