BlenderBender
BlenderBender

Reputation: 566

Why is `mov %eax, %eax; nop` faster than `nop`?

Apparently, modern processors can tell if you do something stupid like moving a register to itself (mov %eax, %eax) and optimize that out. Trying to verify that claim, I ran the following program:

#include <stdio.h>
#include <time.h>

static inline void f1() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %eax, %eax;"
            "nop;"
            );
}

static inline void f2() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "nop;"
            );
}

static inline void f3() {
   for (int i = 0; i < 100000000; i++)
      __asm__(
            "mov %ebx, %eax;"
            "nop;"
            );
}

int main() {
   int NRUNS = 10;
   clock_t t, t1, t2, t3;

   t1 = t2 = t3 = 0;
   for (int run = 0; run < NRUNS; run++) {
      t = clock(); f1(); t1 += clock()-t;
      t = clock(); f2(); t2 += clock()-t;
      t = clock(); f3(); t3 += clock()-t;
   }

   printf("f1() took %f cycles on avg\n", (float) t1/ (float) NRUNS);
   printf("f2() took %f cycles on avg\n", (float) t2/ (float) NRUNS);
   printf("f3() took %f cycles on avg\n", (float) t3/ (float) NRUNS);

   return 0;
}

This gives me:

f1() took 175587.093750 cycles on avg
f2() took 188313.906250 cycles on avg
f3() took 194654.296875 cycles on avg

As one expect, f3() comes out slowest. But surprisingly (to me at least), f1() is faster than f2(). Why is that?

Update: Compiling with -falign-loops gives qualitatively the same result:

f1() took 164271.000000 cycles on avg
f2() took 173783.296875 cycles on avg
f3() took 177765.203125 cycles on avg

Upvotes: 5

Views: 794

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364532

The part of the linked article that made me think that this can be optimized away is: "the move function takes care of checking for equivalent locations"

That's talking about the (move r x) function in SBCL, not the x86 mov instruction. It's talking about optimization during code generation from that low-level intermediate language, not at runtime by the hardware.

Neither mov %eax, %eax nor nop are totally free. They both cost front-end throughput, and mov %eax,%eax isn't even a NOP in 64-bit mode (it zero-extends EAX into RAX, and because it's the same register mov-elimination fails on Intel CPUs.)

See Can x86's MOV really be "free"? Why can't I reproduce this at all? for more about front-end / back-end throughput bottlenecks vs. latency.


You're probably seeing some side-effect of code-alignment, or maybe a funky Sandybridge-family store-forwarding latency effect like in Adding a redundant assignment speeds up code when compiled without optimization because you also compiled with optimization disabled, getting your compiler to make anti-optimized code for consistent debugging that keeps the loop counter in memory. (~6 cycle loop-carried dependency chain through store/reload instead of 1 iteration per clock for a normal tiny loop.)

If your results are reproducible with a larger iteration count, there's probably some microarchitectural explanation for what you're seeing, but it's probably not related to anything you were trying to measure.

Of course you'd also need to fix the mov %ebx, %eax; bug in f3 to compile successfully with optimization enabled. Clobbering EAX without telling the compiler will step on compiler-generated code. You didn't explain what you were trying to test with that, so IDK if it was a typo.

Upvotes: 3

Related Questions