Olsonist
Olsonist

Reputation: 2403

Statically scheduling OOO processors

The LLVM MISched instruction scheduler uses declarative TableGen descriptions of the processor functional units, pipelines and latencies. Imagine trying to determine the equivalent of the coding guidelines from Intel's Optimization Reference Manual from those declarations.

Broadly speaking, what are the objectives/techniques for statically scheduling OOO processors? When would it schedule an instruction A before B and when would it schedule A after B for an OOO processor?

Superscalar processors can execute more than one instruction at a time. An in-order processor will only consider instructions in their original order. An out-of-order (OOO) processor can execute instructions out of order and then commit the results in order. Speculation doesn't matter for this question but I assume these processors are pipelined. Think A53 (in-order) and Haswell (OOO).

Which instruction an OOO processor will execute next is a scheduling decision made by the processor at run time. So this is usually called dynamic scheduling. Which instruction an in-order processor executes was decided by the compiler back when the program was compiled. Consequently this is usually called static scheduling.

However, compilers also statically target/schedule OOO processors. In both in-order and OOO cases, a compiler can look at a large window of instructions; a compiler has to deal with register pressure; and in both cases, a compiler wants to keep functional units busy. OOO processors typically can also rename registers, reducing register pressure.

Given that an OOO processor schedules instructions dynamically, what should the ahead of time compiler do to help this?

Upvotes: 1

Views: 639

Answers (2)

Olsonist
Olsonist

Reputation: 2403

It really isn't a scheduling decision per se but rather an optimization. Basically, looking at the passes which LLVM's addILPOpts() adds for OOO superscalar backends gives a good idea of what is possible. Early if-conversion is generating code which will run in parallel and avoiding code which must run serially.

LLVM has the EarlyIfConverter pass for OOO superscalars. It's used by the PowerPC, X86, AMDGPU, SystemZ and AArch64 backends. EarlyIfConverter evaluates two expressions in parallel and inserts a select to choose one: TII->insertSelect(...)

 // Early if-conversion is for out-of-order CPUs that don't have a lot of
 // predicable instructions. The goal is to eliminate conditional branches that
 // may mispredict.
 //
 // Instructions from both sides of the branch are executed speculatively, and a
 // cmov instruction selects the result.

This pass is added by the backend in addILPOpts(). It is using ILP to evaluate two alternatives in parallel rather than conditionally evaluate one and then the other.

Upvotes: 1

yugr
yugr

Reputation: 21955

You are generally correct but compile-time scheduling can still slightly improve execution speed. This happens because compiler can rearrange instructions in more optimal way to speed up decoding (older variants of x86 could decode multiple instructions in parallel only if sequence satisfies certain constraints) or to pack them more tightly in processor's instruction buffer. To cite from Robert Morgan's "Building an optimizing compiler":

The compiler should schedule the insns as if the processor were
not an out-of-order execution processor.  The more effective this
schedule is, the larger the size of the effective insns buffer.

The wins are normally quite small in practice (few percent).

Upvotes: 4

Related Questions