fraiser
fraiser

Reputation: 969

Multiple nop instructions do not consistently take longer than a single nop instruction

I am timing multiple NOP instructions and a single NOP instruction in C++, using rdtsc. However, I don't get an increase in the number of cycles it takes to execute NOPs in proportion to the number of NOPs executed. I'm confused as to why this is the case. My CPU is Intel Core i7-5600U @ 2.60Ghz.

Here's the code:

#include <stdio.h>

int main() {
    unsigned long long t;

    t = __rdtsc();
    asm volatile("nop");
    t = __rdtsc() - t;
    printf("rdtsc for one NOP: %llu\n", t);

    t = __rdtsc();
    asm volatile("nop; nop; nop; nop; nop; nop; nop;");
    t = __rdtsc() - t;
    printf("rdtsc for seven NOPs: %llu\n", t);

}

I am getting values like:

rdtsc for one NOP: 78
rdtsc for seven NOPs: 91

rdtsc for one NOP: 78
rdtsc for seven NOPs: 78

when running without setting processor affinity. When setting processor affinity like $ taskset -c 0 ./nop$, the results are:

rdtsc for one NOP: 78
rdtsc for seven NOPs: 78

rdtsc for one NOP: 130
rdtsc for seven NOPs: 169

rdtsc for one NOP: 78
rdtsc for seven NOPs: 143

Why would this be the case?

Upvotes: 5

Views: 1798

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364128

Your results here are probably measurement noise and/or frequency scaling, since you start the timer for the 2nd interval right after printf returns from making a system call.

RDTSC counts reference cycles, not core clock cycles, so you're mostly just discovering the CPU frequency. (Lower core clock speed = more reference cycles for the same number of core clocks to run two rdtsc instructions). Your RDTSC instructions are basically back-to-back; the nop instructions are negligible compared to the amount of uops that rdtsc itself decodes to (on normal CPUs including your Broadwell).

Also RDTSC can be reordered by out-of-order execution. Not that nop does anything that the CPU would have to wait for; it's just delaying the front-end by 0.25 or 1.75 cycles from issuing the uops of the 2nd rdtsc. (Actually I'm not sure if the microcode sequencer can send uops in the same cycle as a uop from another instruction. So maybe 1 or 2 cycles).

My answer on How to get the CPU cycle count in x86_64 from C++? has a bunch of background on how RDTSC works.


You might want the pause instruction. It idles for ~100 core clock cycles on Skylake and later, or ~5 cycles on earlier Intel cores. Or spin on PAUSE + RDTSC. How to calculate time for an asm delay loop on x86 linux? shows a possibly-useful delay spinloop that sleeps for a given number of RDTSC counts. You need to know the reference clock speed to correlate that with nanoseconds, but it's typically around the rated max non-turbo clock on Intel CPUs. e.g. 4008 MHz on a 4.0GHz Skylake.

If available, tpause takes a TSC timestamp as the wake-up time. (See the link). But it's only low-power Tremont for now.


Inserting NOPs is never going to work reliably on modern superscalar / out-of-order x86 with huge reorder buffers! Modern x86 isn't a microcontroller where you can calculate iterations for a nested delay loop. If surrounding code doesn't bottleneck on the front-end, OoO exec is just going to hide the cost of feeding your NOPs through the pipeline.

Instructions don't have a cost you can just add up. To model the cost of an instruction, you need to know its latency, front-end uop count, and which back-end execution ports it needs. And any special effects on the pipeline, like lfence waiting for all previous uops to retire before later ones can issue. How many CPU cycles are needed for each assembly instruction?

See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?


Note that your desired "sleep" time of ~100ns isn't necessarily even long enough to drain the out-of-order execution buffer (the ROB) if there are cache misses in flight, or possibly even very a slow ALU dependency chain. (The latter is unlikely outside of artificial cases). So you probably don't want to do anything like lfence.

Upvotes: 6

Related Questions