Koen Van Damme
Koen Van Damme

Reputation: 400

How does a pipelined processor guarantee instruction atomicity so they don't conflict, and so interrupts happen at instruction boundaries?

When a processor executes a single instruction, this can be assumed to be an atomic operation. (Like a database committing a transaction that updates the CPU state, not necessarily in terms of effects on memory visible from other cores.)
So an interrupt or exception happens before or after an instruction, not in the middle.

But how does that work when the processor uses pipelining? The instruction is executed in a number of steps, in parallel with many other instructions, all at different steps. But what if one of those other instructions interferes with ours? How can the processor "roll back" the effects of the instruction, or avoid interference altogether?

Upvotes: 0

Views: 527

Answers (3)

Peter Cordes
Peter Cordes

Reputation: 364583

Terminology: atomic normally refers to memory access.

Normally in CPU architecture, saying something is "atomic" normally refers to its effect on memory as seen from other cores (or other threads on this core, especially on a uniprocessor). That's not the case for single instructions in general, especially on CISCs like x86 where a memory-destination add is actually a load, ALU operation, and separate store. See Is incrementing an int effectively atomic in specific cases? And some RISCs also allow unaligned loads or stores which can span two cache lines or even pages. Under the hood, those are handled as two separate accesses so the whole load isn't atomic; with a writer storing 4-byte 0xffffffff and 0x00000000 to an address like 0x...ffe, a reader loading from the same address on another CPU core could read 0xffff0000 or 0x0000ffff, a torn value.


Instruction commit in the execution model for a single core

Yes, in most ISAs1, the execution model (for a single core) is that each instruction finishes before the next one starts. So interrupts are delivered at some point between two instructions, with all effects of earlier instructions visible, and no effects of later instructions.

So the cardinal rule of pipelined and out-of-order exec is to not break single-threaded programs. A load always has to read the value stored by an earlier store, and same for writes and reads of registers. CPUs track what they're doing to maintain the illusion of running one instruction at a time, in program order.

On simple scalar pipelines, most of the ability to roll back comes from not updating the register file until the last pipeline stage (WB = Write Back). And memory fault detection (e.g. bad address) happens earlier in the pipeline than actual stores. See https://en.wikipedia.org/wiki/Classic_RISC_pipeline for that an a lot of other details of how to pipeline while maintaining correctness.

See How does a mips 5-stage pipeline cpu handle exception and soft interrupt? for a pretty low-level answer about how they can roll back.

And When an interrupt occurs, what happens to instructions in the pipeline? mentions both in-order and out-of-order exec CPUs, with an answer from Andy Glew who worked at Intel on P6, among other things.

In-order CPUs have to detect "hazards" (see that section of the Classic RISC wiki article), especially read-after-write (RAW) true dependencies. And do bypass-forwarding, e.g. from the EXec stage output to the EXec stage input to support two back-to-back increments of the same register without stalling in register-read until the previous one reaches write-back. (Which is also possible but would be a performance disaster).
If they run more than one instruction per clock, write-after-write (WAW) and WAR are also relevant. Or if they have some instructions like a multiply or load that take multiple cycles, and they allow out-of-order completion even if they start instructions in order. Then they also have to detect structural hazards like a write-back conflict where two instructions would be trying to write the register file in the same cycle; if it's not multi-ported, or if the writes are to the same register, they'd stall the later instruction by a cycle. The stall would happen in an early stage, probably decode, where hazard detection and register fetch or forwarding is done. (Because that early stage knows how many cycles the already-issued instructions will take, unless they're variable-latency like divide or a load that could cache miss.)


Out-of-order exec uses a ReOrder Buffer (ROB) and Register Allocation Table (RAT) to keep track of in-flight instructions, and be able to roll back to a consistent state if a fault is detected. (Or when an interrupt arrives, to discard incomplete work instead of waiting for it to finish.) Also a scheduler to keep track of instructions which haven't yet been executed, waiting for their inputs or waiting for an execution unit. (Such an instruction will also have an entry in the ROB.)

(See https://www.realworldtech.com/sandy-bridge/5/ for a block diagram of part of a core. The last page of the article has a diagram of the full core. If I'm talking about part of a CPU that you don't understand, see that and/or Modern Microprocessors A 90-Minute Guide!.)

The ROB is basically a circular buffer: the front-end adds instructions to it in program-order, the retirement stage removes instructions in program-order when they've been marked as having finished executing. This is the point at which they're known to be non-speculative. (So for stores, the corresponding store-buffer entries can be marked as ready to commit to cache and become visible to other cores.) Before that, everything is treated as speculative: any instruction could fault, e.g. divide by 0, load or store to a bad address, or a branch instruction could have been mispredicted so all later instructions in the ROB shouldn't even be there.

Being able to roll back to any ROB entry lets the CPU take an exception exactly at the correct point, on the instruction that caused a page fault for example, so it can be resumed after the OS repairs the situation. Resumed by having the front-end re-fetch it from that program-counter address, since that's the fault return address the CPU pushed on the stack (or however the ISA lets the OS find it).

("Precise exceptions" is the computer architecture terminology for this feature: out-of-order exec without precise exceptions could only tell you that e.g. an FP exception happened somewhere, but wouldn't be able to resume in a consistent state after e.g. a page fault. (For an external interrupt, a CPU could just wait for all in-flight instructions to finish before handling it, but exceptions like page faults or divide by 0 are synchronous and caused by a specific instruction.) Supporting precise exceptions means you need a ROB as large as your out-of-order exec window (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/), unless you use exotic tricks)

I'm not exactly sure how one builds a RAT that supports these rollbacks. I don't think the mapping from architectural to physical regs is just duplicated for every ROB entry; that wouldn't scale on an ISA like AArch64 with 32 architectural regs (and that's just for integer), especially a high performance implementation with hundreds of physical registers. Maybe they save some info about what changed in the RAT so roll-back can scan back to that point and undo changes? For fast recovery for branch mispredicts, I know Intel CPUs at least have a smaller "branch order buffer" with e.g. 48 entries on Skylake vs. the ROB having hundreds (224 uops in Skylake). I don't know if those do snapshot the RAT.

Loads don't have side effects (except from MMIO, so you need to set MMIO regions to uncacheable which also implies no speculative loads), but stores are harder: writing to cache would be visible to other cores, and we can't let any potentially-mis-speculated stuff become visible to other cores. Thus the store buffer, only committing stores after instructions retire: Can a speculatively executed CPU branch contain opcodes that access RAM?


Footnote 1:
Some ISAs do allow partial progress into one instruction, like m68k. So for example a memory-indirect addressing mode could page-fault on loading the pointer, then page-fault again on loading the pointed-to word, with the second page-fault handler evicting the page that held the pointer. The CPU itself has a buffer of the next steps for the currently-executing instruction, which IIRC context-switches can even save/restore. IIRC, m68k can do pre-increment memory-indirect, so if it didn't do save partial progress on interrupts, it would have to buffer the store of the update pointer in case it hit a page fault dereferencing it. And do store-forwarding in case the pointer was pointing at itself.

x86 rep movsb (memcpy of arbitrary length) and other long-running instructions have rules for updating the architectural state with partial progress on interrupts and faults. This is different because there's no hidden state, just register values getting updated with part of the work done. So resume works by just pointing the CPU at the instruction again, and re-fetching and re-decoding it. Same for gathers and scatters; that's one reason they update the mask with partial progress.

Upvotes: 0

DannyMeister
DannyMeister

Reputation: 1303

There are many strategies employed by various processors, I am sure. I once had a project where I added pipelining to a simulated processor. The techniques I employed were

  1. Bubbling. For certain operations that have a chance of interfering with later instructions, I know how far back the interference might occur. For example, if a conditional jump evaluation is not complete until the following instruction has already passed through one stage of the pipeline, I might place what is effectively a NOP into the pipeline directly behind the conditional jump.

  2. Forwarding. Under a couple conditions I was able to see that I needed a value from the instruction that was a stage or two ahead of the current one, but that had not yet been copied into the register/gate that it normally accesses it from. In this case, it accesses it directly from the later stage.

  3. Branch Prediction and correction. The prediction part isn't so much about how to avoid collisions, but it is important to note. In the case of conditional jumps, you want to make a prediction about what will occur and load the next instruction into the pipeline as early as possible. I always assumed that the condition would evaluate such that a jump is NOT taken, because then I can immediately load the next instruction into the pipeline without first evaluating the jump address, thereby avoiding the need for a bubble.

When the prediction comes true, yay, I am happy. If the prediction does not come true, then I need to negate the effect of the next instruction that we optimistically started up early. I did this by switching a signal to the nand gates within the previous couple pipeline stages to effectively NOP out the instruction that was currently executing there.

This is what I remember from my only personal experience. I took a look at the wikipedia page for Instruction Pipeline and see some of those same ideas present, with far better explanation, I'm sure :) http://en.wikipedia.org/wiki/Instruction_pipeline

Upvotes: 3

Al Kepp
Al Kepp

Reputation: 5980

This is defined by a designer of a prcessor, and it can be different for each particular processor. For example if we take typical Intel/AMD x86/x64 processor family, single instruction is not always atomic.

You must always say what processor type you are talking about. And if it is a different platform than x86/x64, you can probably get better answer at Electronics forum, not here.

Upvotes: 1

Related Questions