Weijie Chen
Weijie Chen

Reputation: 164

Do single threaded programs execute in parallel in a CPU?

While measuring the CPI (cycles per instruction) of an Intel 4th Gen i5 we got a CPI < 1.

Some of the classmates suggested that it was due to parallel execution of the code, but it was a single-threaded code in C, the teacher said that nowadays processors are superscalar.

The compile was done with gcc -m32.

Assuming the compiler is not doing magic by parallelizing the code.

But my doubts still remain. Since processors nowadays do some little magic with the code, say out-of-order execution and speculative execution I wonder if:

Lets say we have these two instructions:

(1) addl %eax, (%eax) (1) addl %ebx, (%ebx)

Core-0 runs (1) and Core-1 runs (2)

Upvotes: 1

Views: 1648

Answers (4)

Peter Cordes
Peter Cordes

Reputation: 365707

Yes, CPUs find instruction-level parallelism within a single thread to run more than 1 instruction per cycle. See Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop? for a specific example.

Instruction-level parallelism is unrelated to thread-level parallelism (which you bring up in the 2nd part of your question). When running a single-threaded workload, only one core is active.

Modern multi-core systems exploit both, but you can have one without the other.

For example Sun's Niagara (UltraSPARC T1) is designed from the ground up to exploit thread-level parallelism (and memory-level parallelism) but not try to run any single thread fast, like some kind of server workloads. It has 8 physical single-issue in-order cores, and 4-way SMT (4 logical cores per physical core) instead of OoO exec to hide latency (e.g. cache misses). Single-threaded performance is crap, but max throughput running 32 threads was good for 2005 with its power budget and transistor count.

Earlier x86, like Pentium III, were superscalar single-core. Only multi-socket systems were SMP. But such CPUs could and did achieve CPI < 1.

Your i5 4th-gen CPU is Haswell. See David Kanter's deep dive on Haswell's microarchitecture, including block diagrams of how wide various stages are inside each core.

Do processors run in multiple core single-threaded programs?

NO, a single core is superscalar on its own, e.g. having 4 integer ALU execution units in Haswell or Zen. (And on Intel CPUs, 3 SIMD ALU execution units which are on the same execution ports as the scalar / general-purpose integer ALUs.) And a front-end wide enough to match.

In general, superscalar CPUs are capable of running at least 2 instructions per clock per core.

This wrong guess in your question is a duplicate of How does a single thread run on multiple cores? on programmers.SE. Answer: they don't; each core has a wide front-end and multiple execution units in the back-end.

Until we hit diminishing returns on making a single core wider, we kept doing that instead of building multi-core CPUs; time slicing / pre-emptive multitasking was generally good enough. One faster core is better than N cores of 1/N speed for almost everything. But these days that's not the tradeoff; it's N cores of 1/sqrt(N) speed or something like.


  • addl %eax, (%eax)
  • addl %ebx, (%ebx)

These memory-destination add instruction take more than 1 cycle each to complete (and decode to at least 2 uops each on modern Intel: load+add micro-fused, and store (micro-fused store-address + store-data). The load+add parts can both start in the same cycle if they ran on the same physical core.

Ice Lake could also execute both stores in the same cycle, but before that modern x86 CPUs only do 1 store per clock. (e.g. Intel from Haswell through Coffee Lake can do 2 loads + 1 store per clock cycle. SnB/IvB can do address-generation for 2 memory operations per cycle, and sustain the throughput if up to one of them is a store. With a special 2+1 case for 256-bit vectors that reuse the same address-generation for 2 cycles of data.)

Unless EAX and EBX hold the same pointer value, these instruction access different memory and different registers, and are fully independent except for resource conflicts for execution units (load, add, store). (Register renaming handles the write-after-write hazard for the FLAGS output).

Upvotes: 6

instinct71
instinct71

Reputation: 379

Superscalar processors have the ability to fetch/decode/execute many instructions at the same time. This is achieved by provisioning enough HW resources to handle several instructions. For example: Execute stage will have multiple ALUs etc.

Upvotes: 3

Yes. A very large part of the CPU is dedicated to the so-called scheduling, which assigns work to the internal resources of the CPU. Whenever this scheduling circuitry can prove that two instructions do not collide with the resources they require (functional units like the different ALUs, and, most importantly, registers), those instructions may be scheduled in parallel. This can lead to an CPI less than one.

Typical instructions that are independent of each other are control flow (branches), integer arithmetic, and floating point arithmetic. Especially the later two are virtually always independent because they require very different ALUs, and operate on data of different type. So, when your program does for instance

double a = 7.0, factor = 1.1;
for(int i = 42; i--; ) a *= factor;

you may find that the floating point circuitry is performing the multiplication at the same time as the integer circuitry decrements and checks the loop counter, while the control flow circuitry executes the branch to the next loop iteration. Such loops have the potential to execute in exactly one cycle per iteration...


I've chosen the example of different kinds of instructions, because it makes it easy to understand that you need different resources for each instruction. However, modern CPUs typically contain more than one copy of key resources. They contain quite a number of ALUs capable of integer addition/subtraction, for instance, and use a complex scheme of register renaming to make use of many more physical registers than there are registers visible to the assembler programmer. This allows them to execute two independent instructions in parallel, even when they are of the same type (integer addition, for example) and formally operate on the same registers.

Basically, you can think of the CPU front-end as a just-in-time compiler that translates the machine code into an internal instruction set, including an optimizer that tries to keep as many of the CPUs resources busy as possible.

Upvotes: 3

Jesse Williams
Jesse Williams

Reputation: 662

It's important to understand some distinction between parallelization and concurrency. While the terms are often used interchangeably, they are actually different.

Prior to the existence of multiple-core processors and multiple-processor systems, concurrency was used, typically via time slicing, which gave the appearance of parallel processing by means of executing code serially in adjacent slices of time on the CPUs clock. The end result was that things would complete at roughly the same time and appear to have been done simultaneously. It's worth noting that concurrency is still used quite a bit.

Parallelization instead run multiple threads and allows work to be done asynchronously on various cores, which can later be recombined (more or less) to provide the expected results or feedback in the UI, actions in a game, et cetera.

Some modern compilers and CPU/GPU instruction sets can parallelize things that aren't explicitly parallel in code. Also, some benchmarking may over- or undervalue the threading capabilities of a given core or processor.

Upvotes: -1

Related Questions