Why does a NOP (as a 5th uop) speed up a 4 uop loop on Ice Lake?

Question

All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)

Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake

I was trying to benchmark different methods of doing integer min(a, b) but ran into some unexplained behavior that I've boiled down to the following benchmark:

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))

#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL  0
#define BREAK_DEPENDENCY      0
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start, end;
    const uint64_t N = 1000000;
    start            = _rdtsc();
    uint64_t v0, v1, dst, loop_cnt;
    asm volatile(
        "xorl %k[v0], %k[v0]
	"
        "movl $1, %k[v1]
	"
        "movl %[N], %k[loop_cnt]
	"
        ".p2align 6
	"
        "1:
	"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt], %k[v0]
	"
        "xorl %k[loop_cnt], %k[v1]
	"
        "movl %k[v0], %k[dst]
	"
#else
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
#endif
        ".p2align 4
	"
#if WITH_NOP_BEFORE_DECL
        "nop
	"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
#endif
        // macro-fusion is NOT broken
        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
        "BREAK_DEPENDENCY - [%s]
	",
        SIX_BYTES_COMPUTATION ? "ON" : "OFF",
        WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"
", dif);
}

Turning on WITH_NOP_BEFORE_DECL so that there is a nop before the decl + jnz causes a measurable performance improvement when SIX_BYTES_COMPUTATION is turned on though causes measurable performance degradation when SIX_BYTES_COMPUTATION is turned off.

Here are the numbers:

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    2.080 "Cycles" <--- Just 6 nops

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    2.363 "Cycles" <--- Performance degradation from previous

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    2.185 "Cycles" <--- Computation then decl + jnz

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    1.945 "Cycles" <--- Performance improvement from previous

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    1.919 "Cycles" <--- Breaking dependencies has best performance

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
    2.046 "Cycles" <--- nop hurts performance when breaking dependencies

It might have to do with the register file filling up? I was able to find one potentially interesting metric uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread] which has the following outputs:

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    473,647      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    495,380      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    1,406,244      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    875,364      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    647,297      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
    501,015      uops_issued.stall_cycles

It does seem to correspond with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL on or off but I'm not sure 1) why a nop would save room in the register files.

I'm pretty sure its not an alignment issue because with the .p2align 4 between the first 6 bytes of the loop body and the decl + jnz the decl + jnz will be on a different 16 byte aligned region either way and it the performance difference is reversed depending on what is in the loop body (so if it where an alignment thing it shouldn't matter if loop body is nops or computation).

I was thinking it might to do with some dependency chain issue but because if I break the dependency on v0 and v1 at the end of the loop then WITH_NOP_BEFORE_DECL turned on causes a performance degradation. I am very likely wrong about that though because I have no idea why a nop before the end of the loop would affect any dependency issues.

It almost definitely does not have to do with port scheduling. I was thinking maybe something weird was going where the nop by chance lead to better scheduling but didn't get any different in uops on ports 1,2,5,6 with out or WITH_NOP_BEFORE_DECL on:

Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL Off:

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF]
         1,147,196      uops_dispatched.port_0
         1,114,665      uops_dispatched.port_1
         1,138,238      uops_dispatched.port_5
         1,266,212      uops_dispatched.port_6

Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL On:

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON]
         1,177,092      uops_dispatched.port_0
         1,081,734      uops_dispatched.port_1
         1,103,314      uops_dispatched.port_5
         1,296,546      uops_dispatched.port_6

My leading theory is that there is some inefficiency in the register renaming process that is the performance bound without the nop and by luck the nop hides that issue but I am not at all confident in that.

Can anyone help me understand this behavior.

Edit: Full cpp code and new times w/ warmup and lfence before rdtsc.

New Code

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))

#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start, end;
    const uint64_t N        = (1UL << 24);
    const uint64_t WARMUP_N = N << 3;
    uint64_t       v0, v1, dst, loop_cnt;


    asm volatile(
        "xorl %k[v0], %k[v0]
	"
        "movl $1, %k[v1]
	"
        "movl %[N], %k[loop_cnt]
	"
        ".p2align 6
	"
        "1:
	"
        "xorl %k[loop_cnt], %k[v0]
	"
        "xorl %k[loop_cnt], %k[v1]
	"
        "movl %k[v0], %k[dst]
	"
        ".p2align 4
	"
        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc", "memory");


    asm volatile("lfence
	" : : : "memory");
    start = _rdtsc();
    asm volatile(
        "xorl %k[v0], %k[v0]
	"
        "movl $1, %k[v1]
	"
        "movl %[N], %k[loop_cnt]
	"
        "lfence
	"
        ".p2align 6
	"
        "1:
	"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt], %k[v0]
	"
        "xorl %k[loop_cnt], %k[v1]
	"
        "movl %k[v0], %k[dst]
	"
#else
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
#endif
        ".p2align 4
	"
#if WITH_NOP_BEFORE_DECL
        "nop
	"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
#endif
        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        "lfence
	"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
        "BREAK_DEPENDENCY - [%s]
	",
        SIX_BYTES_COMPUTATION ? "ON" : "OFF",
        WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"
", dif);
}


int
main(int argc, char ** argv) {
    bench();
}

New Times

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    0.645 "Cycles"

New times have the same trend as before just they are all a lot faster.

Edit: Icelake perf numbers

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.681 "Cycles"
    18,522,385,353      lsd.uops
         1,038,665      idq.dsb_uops
     4,270,402,172      cpu-cycles


SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.778 "Cycles"
    20,669,567,680      lsd.uops
         1,049,193      idq.dsb_uops
     4,807,261,565      cpu-cycles


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.734 "Cycles"
    12,080,048,840      lsd.uops
         1,035,128      idq.dsb_uops
     4,552,666,461      cpu-cycles


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.659 "Cycles"
    14,232,154,418      lsd.uops
         1,150,777      idq.dsb_uops
     4,134,080,501      cpu-cycles


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.735 "Cycles"
    12,080,166,963      lsd.uops
           982,311      idq.dsb_uops
     4,553,457,015      cpu-cycles


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    0.644 "Cycles"
    16,374,872,770      lsd.uops
         1,022,379      idq.dsb_uops
     4,055,306,960      cpu-cycles

Edit: I am certain its not dependency chain or byte related. Its adding one nop (non-backend uop) in certain places really helped performance. Hers a benchmark that demonstrates that really clearly I think.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif

#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
void BENCH_FUNC_ATTR
bench() {
    uint64_t          start, end;
    const uint64_t    N        = (1UL << 31);
    const uint64_t    WARMUP_N = N >> 3;
    register uint64_t v0 asm("rdi");
    register uint64_t v1 asm("rsi");
    register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
    register uint64_t v3 asm("rax");
#endif
    register uint64_t loop_cnt asm("rcx");


    asm volatile(
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
        "xorl %k[v2], %k[v2]
	"
#if COMPUTE_OUP
        "xorl %k[v3], %k[v3]
	"
#endif
        "movl %[N], %k[loop_cnt]
	"
        "lfence
	"
        ".p2align 6
	"
        "1:
	"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop
	"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax
	"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)
	"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop
	"
        "nop
	"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax
	"
        "xchg   %%ax,%%ax
	"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]
	"
#elif COMPUTE_UOP
        "incl %k[v3]
	"
#endif

        "incl %k[v0]
	"
        "incl %k[v1]
	"
        "incl %k[v2]
	"

        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        "lfence
	"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),
#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc", "memory");


    start = _rdtsc();
    asm volatile(
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
        "xorl %k[v2], %k[v2]
	"
#if COMPUTE_OUP
        "xorl %k[v3], %k[v3]
	"
#endif
        "movl %[N], %k[loop_cnt]
	"
        "lfence
	"
        ".p2align 6
	"
        "1:
	"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop
	"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax
	"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)
	"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop
	"
        "nop
	"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax
	"
        "xchg   %%ax,%%ax
	"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop
	"
        "nop
	"
        "nop
	"
        "nop
	"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]
	"
        "xorl %k[v1], %k[v1]
	"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]
	"
#elif COMPUTE_UOP
        "incl %k[v3]
	"
#endif

        "incl %k[v0]
	"
        "incl %k[v1]
	"
        "incl %k[v2]
	"

        "decl %k[loop_cnt]
	"
        "jnz 1b
	"
        "lfence
	"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),
#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf("UOP         -> %d
", UOP);
    printf("BYTE        -> %d
", BYTE);
    printf("NOP         -> %d
", NOP);
    printf("BREAK_DEP   -> %d
", BREAK_DEP);
    printf("COMPUTE_UOP -> %d
", COMPUTE_UOP);
    printf("%.3lf \"Cycles\"
", dif);
}


int
main(int argc, char ** argv) {
    bench();
}

Results: You can see basically if its 1 uop that WONT be executed in the backend performance is ~.39 ref-cycles an iteration for a 5-uop loop (ICL front-end width). Otherwise with no NOP or xor-zeroing filler its ~.54 ref-cycles an iteration of a 4-uop loop:

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801      idq_uops_not_delivered.cycles_fe_was_ok
    5,840,894      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257      idq_uops_not_delivered.cycles_fe_was_ok
    4,791,034      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,411,711      idq_uops_not_delivered.cycles_fe_was_ok
    5,915,776      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,038,334      idq_uops_not_delivered.cycles_fe_was_ok
  215,022,743      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,032,032,735      idq_uops_not_delivered.cycles_fe_was_ok
  214,953,593      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924      idq_uops_not_delivered.cycles_fe_was_ok
    7,883,534      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.395 "Cycles"
2,440,807,570      idq_uops_not_delivered.cycles_fe_was_ok
   26,095,530      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.520 "Cycles"
2,821,992,876      idq_uops_not_delivered.cycles_fe_was_ok
    4,762,782      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 1
0.624 "Cycles"
3,864,366,562      idq_uops_not_delivered.cycles_fe_was_ok
1,450,508,248      uops_issued.stall_cycles
--------------------------------------------------------------

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.539 "Cycles"
2,947,391,859      idq_uops_not_delivered.cycles_fe_was_ok
1,341,303,591      uops_issued.stall_cycles

Run Script (Fixed):

import os
import sys

fname = "test-nop"
if (len(sys.argv) > 1):
    fname = sys.argv[1]

build_cmd = "g++ -DUOP={} -DBYTE={} -DNOP={} -DBREAK_DEP={} -DCOMPUTE_UOP={} -O3 -std=c++17 -march=native -mtune=native " + fname + ".cc -o " + fname
run_cmd = "perf stat -e idq_uops_not_delivered.cycles_fe_was_ok -e uops_issued.stall_cycles ./{}"

zero_one = [0, 1]

uop = [1, 2, 4]
byte = [1, 2, 4]
nop = [1]
break_dep = [1]
compute_uop = [1]

for n in nop:
    for u in uop:
        for b in byte:
            if b < u:
                continue

            os.system(build_cmd.format(u, b, n, 0, 0))
            os.system(run_cmd.format(fname))

for bd in break_dep:
    for u in uop:
        for b in byte:
            if b != 2 * u:
                continue

            if b < u:
                continue
            os.system(build_cmd.format(u, b, 0, bd, 0))
            os.system(run_cmd.format(fname))

os.system(build_cmd.format(1, 2, 0, 0, 1))
os.system(run_cmd.format(fname))

os.system(build_cmd.format(0, 0, 0, 0, 0))
os.system(run_cmd.format(fname))

Edit: Something interesting. It appears that for a nop to make a 5-uop loop perform better than a 4-uop loop the placement is important. Zero-idiom xorl, however always improves performance. Here are the numbers of the 4 cases where we where seeing the 5-uop loop perform with the nop / xorl interleaved at different points. The nop version only has an improvement when it is the first instruction whereas the xorl version always has the performance improvement. This is a bit strange given the first results where a nop at was helping. The only thing I can think of is that location affects where things are being placed in the uop-cache or LSD buffer maybe?

Numbers:

################################################################
    
    incl
    incl
    incl
    decl
    jnz



UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,418,941,957      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,418,490,126      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.390 "Cycles"
2,419,125,302      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,033,520,044      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.394 "Cycles"
2,442,515,834      idq_uops_not_delivered.cycles_fe_was_ok


################################################################
    incl
    
    incl
    incl
    decl
    jnz


UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.566 "Cycles"
3,390,955,219      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,373,556,409      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.565 "Cycles"
3,380,145,525      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.391 "Cycles"
2,428,978,799      idq_uops_not_delivered.cycles_fe_was_ok


################################################################
    incl
    incl
    
    incl
    decl
    jnz



UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,377,709,071      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,377,494,813      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,377,019,951      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.389 "Cycles"
2,420,319,618      idq_uops_not_delivered.cycles_fe_was_ok


################################################################
    incl
    incl
    incl
    
    decl
    jnz



UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.556 "Cycles"
3,329,607,623      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.559 "Cycles"
3,340,246,297      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,329,254,092      idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.543 "Cycles"
3,279,214,443      idq_uops_not_delivered.cycles_fe_was_ok

Edit: Data for trial with 4 independent incl instructions in the loop. Making this a 6 uop loop with a nop or a 5uop loop without. I was able to see a measurable and reproducible performance improvement (more modest) when adding a 6th uop in the following cases: If the 6th uop was a nop (1, 2, or 4 bytes) it must be between the 1st and 2nd incl. If the 6th uop was a zero-idiom xor it can be anywhere. Here are the results when the 6th instruction is between the 1st and 2nd incl:

Loop looks like:

incl
<6th instruction>
incl
incl
incl
decl
jnz

Times:

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.603 "Cycles"
3,242,400,541      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.604 "Cycles"
3,244,473,075      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.601 "Cycles"
3,239,305,874      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.641 "Cycles"
3,330,420,250      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.649 "Cycles"
3,334,340,019      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.788 "Cycles"
3,989,749,825      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 1
BYTE        -> 2
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.551 "Cycles"
2,893,829,059      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 2
BYTE        -> 4
NOP         -> 0
BREAK_DEP   -> 1
COMPUTE_UOP -> 0
0.604 "Cycles"
3,007,481,786      idq_uops_not_delivered.cycles_fe_was_ok

UOP         -> 0
BYTE        -> 0
NOP         -> 0
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.620 "Cycles"
3,755,030,033      idq_uops_not_delivered.cycles_fe_was_ok

Why does a NOP (as a 5th uop) speed up a 4 uop loop on Ice Lake?

Answers (0)

Related Questions