Christopher
Christopher

Reputation: 799

Does the CPU actually executes an instruction before the other when re-ordering, or is it only the end result that gives this "illusion"?

Based on what I have read, a CPU can re-order the execution of instructions, and a memory barrier prevents the re-ordering of instruction from before to after and from after to before the memory barrier.

But there is something that I am not sure of. Say I have the following instructions:

store x
store y

Let's say that the CPU decided to execute store y before store x.

How does the CPU does that, does it completely ignores store x and executes store y first? Or does the following happen?:

  1. store x is executed, but it is not completed immediately (it becomes pending).
  2. store y is executed, and it is completed immediately.
  3. The pending store x is completed.

So basically, this gave the "illusion" that the instructions were executed out of order, even though they didn't, they only completed out of order.


I am asking this question to understand how a memory barrier work.

For example say I have the following instructions:

store x
mfence
store y

Now when the CPU executes these instructions, will the following happen?:

  1. store x is executed, but it is not completed immediately (it becomes pending).
  2. mfence is executed, now since this instruction is a memory barrier, the CPU will make sure that all pending operations before it (store x) will be completed before continuing with the execution of instructions.
  3. store y is executed.

Upvotes: 1

Views: 465

Answers (2)

fuz
fuz

Reputation: 92966

On a super-scalar processor, you can have operations queued up waiting for previous instructions to complete. Imagine code like this:

...
div %esi        # divide edx:eax by esi
mov %eax,(%ebx) # store quotient in (%ebx)
mov $1,(%ecx)   # store 1 in (%ecx)

On a super-scalar processor, the first mov instruction will be encountered right after the div instruction is dispatched. However, at that time div hasn't finished yet. Thus the store instruction is queued in the instruction queue until the result of div %esi is available in %eax. In the next cycle, the processor encounters mov $1,(%ecx). Since the immediate $1 is immediately available, the processor doesn't have to wait and can execute the store immediately. Some time after the store has been dispatched, the div instruction finishes, causing the store to be released from the instruction queue and executed.

This is how it happens that stores occur in a different order than the machine code specifies. The CPU has extra logic to ensure that this detail isn't usually visible to the programmer, but depending on what architecture you program for, different artifacts can exist.

Upvotes: 0

Johan
Johan

Reputation: 76537

mfence does not prevent out-of-order execution.
It merely ensures all memory loads and stores preceeding the mfence are all serialized prior to executing any memory loads or stores after the mfence.

See: http://x86.renejeschke.de/html/file_module_x86_id_170.html

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible.

X86 does is limited in OoO memory accesses in any case
The x86 architecture does have a few memory ordering rules built-in already.
The gist of this is that memory accesses receive very little reordering.

Here's the official write-up from Intel: http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

The gist has been most helpfully listed in the index :-)

Memory ordering for write-back (WB) memory
* Loads are not reordered with other loads and stores are not reordered with other stores
* Stores are not reordered with older loads
* Loads may be reordered with older stores to different locations
[...]
* Loads and stores are not reordered with locks

Back to your questions

Does the CPU actually executes an instruction before the other when re-ordering
Yes, you can see this when timing the code.

Let me give you an example, let's assume we have an AMD jaguar which can execute 2 instructions in parallel and has full OoO.

a: mov ebx,[eax]      //1 cycle throughput
b: mov ecx,2          //pairs
c: imul eax,edx       //3 cycles latency
d: add eax,ebp        //1 cycle, needs to wait for c

Normally this snippet would take 1+3+1 = 5 cycles. However, the CPU will execute this in the following order:

c: imul eax,edx      //3 cycle latency
a: mov ebx,[eax']    //pairs, eax is renamed to eax' in the register rename buffer
b: mov ecx,2         //1 cycle
d: add eax,ebp       //1 cycle waits for c

This only takes 4 cycles. 3 for a and 1 for d, all the rest gets interleaved.
There obviously is space to squeeze more instructions between c and d and the CPU will do so if it has any instructions that are applicable.

Note that the CPU reorders a memory load, as long is it's not relative to another memory load (and a few other restrictions, see above).
Also note that AMD and Intel follow the exact same semantics.

Upvotes: 3

Related Questions