Reputation: 1759
All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)
Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake
I was trying to benchmark different methods of doing integer min(a, b)
but ran into some unexplained behavior that I've boiled down to the following benchmark:
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL 0
#define BREAK_DEPENDENCY 0
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = 1000000;
start = _rdtsc();
uint64_t v0, v1, dst, loop_cnt;
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
#else
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#endif
".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
"nop\n\t"
#endif
#if BREAK_DEPENDENCY
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#endif
// macro-fusion is NOT broken
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf(
"SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
"BREAK_DEPENDENCY - [%s]\n\t",
SIX_BYTES_COMPUTATION ? "ON" : "OFF",
WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");
printf("%.3lf \"Cycles\"\n", dif);
}
Turning on WITH_NOP_BEFORE_DECL
so that there is a nop
before the decl
+ jnz
causes a measurable performance improvement when SIX_BYTES_COMPUTATION
is turned on though causes measurable performance degradation when SIX_BYTES_COMPUTATION
is turned off.
Here are the numbers:
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
2.080 "Cycles" <--- Just 6 nops
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
2.363 "Cycles" <--- Performance degradation from previous
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
2.185 "Cycles" <--- Computation then decl + jnz
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
1.945 "Cycles" <--- Performance improvement from previous
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
1.919 "Cycles" <--- Breaking dependencies has best performance
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
2.046 "Cycles" <--- nop hurts performance when breaking dependencies
It might have to do with the register file filling up? I was able to find one potentially interesting metric uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread]
which has the following outputs:
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
473,647 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
495,380 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
1,406,244 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
875,364 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
647,297 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
501,015 uops_issued.stall_cycles
It does seem to correspond with SIX_BYTES_COMPUTATION
on and WITH_NOP_BEFORE_DECL
on or off but I'm not sure 1) why a nop
would save room in the register files.
I'm pretty sure its not an alignment issue because with the .p2align 4
between the first 6 bytes of the loop body and the decl
+ jnz
the decl
+ jnz
will be on a different 16 byte aligned region either way and it the performance difference is reversed depending on what is in the loop body (so if it where an alignment thing it shouldn't matter if loop body is nops or computation).
I was thinking it might to do with some dependency chain issue but because if I break the dependency on v0
and v1
at the end of the loop then WITH_NOP_BEFORE_DECL
turned on causes a performance degradation. I am very likely wrong about that though because I have no idea why a nop
before the end of the loop would affect any dependency issues.
It almost definitely does not have to do with port scheduling. I was thinking maybe something weird was going where the nop
by chance lead to better scheduling but didn't get any different in uops on ports 1,2,5,6 with out or WITH_NOP_BEFORE_DECL
on:
Instruction per port with SIX_BYTES_COMPUTATION
on and WITH_NOP_BEFORE_DECL
Off:
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF]
1,147,196 uops_dispatched.port_0
1,114,665 uops_dispatched.port_1
1,138,238 uops_dispatched.port_5
1,266,212 uops_dispatched.port_6
Instruction per port with SIX_BYTES_COMPUTATION
on and WITH_NOP_BEFORE_DECL
On:
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON]
1,177,092 uops_dispatched.port_0
1,081,734 uops_dispatched.port_1
1,103,314 uops_dispatched.port_5
1,296,546 uops_dispatched.port_6
My leading theory is that there is some inefficiency in the register renaming process that is the performance bound without the nop
and by luck the nop
hides that issue but I am not at all confident in that.
Can anyone help me understand this behavior.
Edit: Full cpp code and new times w/ warmup and lfence
before rdtsc
.
New Code
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = (1UL << 24);
const uint64_t WARMUP_N = N << 3;
uint64_t v0, v1, dst, loop_cnt;
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
".p2align 4\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc", "memory");
asm volatile("lfence\n\t" : : : "memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
#else
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#endif
".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
"nop\n\t"
#endif
#if BREAK_DEPENDENCY
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#endif
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf(
"SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
"BREAK_DEPENDENCY - [%s]\n\t",
SIX_BYTES_COMPUTATION ? "ON" : "OFF",
WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");
printf("%.3lf \"Cycles\"\n", dif);
}
int
main(int argc, char ** argv) {
bench();
}
New Times
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
0.645 "Cycles"
New times have the same trend as before just they are all a lot faster.
Edit: Icelake perf numbers
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.681 "Cycles"
18,522,385,353 lsd.uops
1,038,665 idq.dsb_uops
4,270,402,172 cpu-cycles
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.778 "Cycles"
20,669,567,680 lsd.uops
1,049,193 idq.dsb_uops
4,807,261,565 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.734 "Cycles"
12,080,048,840 lsd.uops
1,035,128 idq.dsb_uops
4,552,666,461 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.659 "Cycles"
14,232,154,418 lsd.uops
1,150,777 idq.dsb_uops
4,134,080,501 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.735 "Cycles"
12,080,166,963 lsd.uops
982,311 idq.dsb_uops
4,553,457,015 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
0.644 "Cycles"
16,374,872,770 lsd.uops
1,022,379 idq.dsb_uops
4,055,306,960 cpu-cycles
Edit: I am certain its not dependency chain or byte related. Its adding one nop (non-backend uop) in certain places really helped performance. Hers a benchmark that demonstrates that really clearly I think.
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif
#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = (1UL << 31);
const uint64_t WARMUP_N = N >> 3;
register uint64_t v0 asm("rdi");
register uint64_t v1 asm("rsi");
register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
register uint64_t v3 asm("rax");
#endif
register uint64_t loop_cnt asm("rcx");
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
"xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
"xorl %k[v3], %k[v3]\n\t"
#endif
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if UOP == 1 && BYTE == 1 && NOP == 1
"nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
"xchg %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
"nopl 0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
"xchg %%ax,%%ax\n\t"
"xchg %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
"incl %k[v3]\n\t"
#endif
"incl %k[v0]\n\t"
"incl %k[v1]\n\t"
"incl %k[v2]\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
[ v3 ] "=&r"(v3),
#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc", "memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
"xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
"xorl %k[v3], %k[v3]\n\t"
#endif
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if UOP == 1 && BYTE == 1 && NOP == 1
"nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
"xchg %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
"nopl 0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
"xchg %%ax,%%ax\n\t"
"xchg %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
"incl %k[v3]\n\t"
#endif
"incl %k[v0]\n\t"
"incl %k[v1]\n\t"
"incl %k[v2]\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
[ v3 ] "=&r"(v3),
#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf("UOP -> %d\n", UOP);
printf("BYTE -> %d\n", BYTE);
printf("NOP -> %d\n", NOP);
printf("BREAK_DEP -> %d\n", BREAK_DEP);
printf("COMPUTE_UOP -> %d\n", COMPUTE_UOP);
printf("%.3lf \"Cycles\"\n", dif);
}
int
main(int argc, char ** argv) {
bench();
}
Results: You can see basically if its 1 uop that WONT be executed in the backend performance is ~.39 ref-cycles an iteration for a 5-uop loop (ICL front-end width). Otherwise with no NOP or xor-zeroing filler its ~.54 ref-cycles an iteration of a 4-uop loop:
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801 idq_uops_not_delivered.cycles_fe_was_ok
5,840,894 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257 idq_uops_not_delivered.cycles_fe_was_ok
4,791,034 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,411,711 idq_uops_not_delivered.cycles_fe_was_ok
5,915,776 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,038,334 idq_uops_not_delivered.cycles_fe_was_ok
215,022,743 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,032,032,735 idq_uops_not_delivered.cycles_fe_was_ok
214,953,593 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 4
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924 idq_uops_not_delivered.cycles_fe_was_ok
7,883,534 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.395 "Cycles"
2,440,807,570 idq_uops_not_delivered.cycles_fe_was_ok
26,095,530 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 4
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.520 "Cycles"
2,821,992,876 idq_uops_not_delivered.cycles_fe_was_ok
4,762,782 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 1
0.624 "Cycles"
3,864,366,562 idq_uops_not_delivered.cycles_fe_was_ok
1,450,508,248 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 0
BYTE -> 0
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.539 "Cycles"
2,947,391,859 idq_uops_not_delivered.cycles_fe_was_ok
1,341,303,591 uops_issued.stall_cycles
Run Script (Fixed):
import os
import sys
fname = "test-nop"
if (len(sys.argv) > 1):
fname = sys.argv[1]
build_cmd = "g++ -DUOP={} -DBYTE={} -DNOP={} -DBREAK_DEP={} -DCOMPUTE_UOP={} -O3 -std=c++17 -march=native -mtune=native " + fname + ".cc -o " + fname
run_cmd = "perf stat -e idq_uops_not_delivered.cycles_fe_was_ok -e uops_issued.stall_cycles ./{}"
zero_one = [0, 1]
uop = [1, 2, 4]
byte = [1, 2, 4]
nop = [1]
break_dep = [1]
compute_uop = [1]
for n in nop:
for u in uop:
for b in byte:
if b < u:
continue
os.system(build_cmd.format(u, b, n, 0, 0))
os.system(run_cmd.format(fname))
for bd in break_dep:
for u in uop:
for b in byte:
if b != 2 * u:
continue
if b < u:
continue
os.system(build_cmd.format(u, b, 0, bd, 0))
os.system(run_cmd.format(fname))
os.system(build_cmd.format(1, 2, 0, 0, 1))
os.system(run_cmd.format(fname))
os.system(build_cmd.format(0, 0, 0, 0, 0))
os.system(run_cmd.format(fname))
Edit: Something interesting. It appears that for a nop
to make a 5-uop loop perform better than a 4-uop loop the placement is important. Zero-idiom xorl
, however always improves performance. Here are the numbers of the 4 cases where we where seeing the 5-uop loop perform with the nop
/ xorl
interleaved at different points. The nop
version only has an improvement when it is the first instruction whereas the xorl
version always has the performance improvement. This is a bit strange given the first results where a nop
at was helping. The only thing I can think of is that location affects where things are being placed in the uop-cache or LSD buffer maybe?
Numbers:
################################################################
<nop, xorl, etc...>
incl
incl
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,418,941,957 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,418,490,126 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.390 "Cycles"
2,419,125,302 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,033,520,044 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.394 "Cycles"
2,442,515,834 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
<nop, xorl, etc...>
incl
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.566 "Cycles"
3,390,955,219 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,373,556,409 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.565 "Cycles"
3,380,145,525 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.391 "Cycles"
2,428,978,799 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
incl
<nop, xorl, etc...>
incl
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,377,709,071 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.563 "Cycles"
3,377,494,813 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.564 "Cycles"
3,377,019,951 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.389 "Cycles"
2,420,319,618 idq_uops_not_delivered.cycles_fe_was_ok
################################################################
incl
incl
incl
<nop, xorl, etc...>
decl
jnz
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.556 "Cycles"
3,329,607,623 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.559 "Cycles"
3,340,246,297 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.553 "Cycles"
3,329,254,092 idq_uops_not_delivered.cycles_fe_was_ok
----------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.543 "Cycles"
3,279,214,443 idq_uops_not_delivered.cycles_fe_was_ok
Edit: Data for trial with 4 independent incl
instructions in the loop. Making this a 6 uop loop with a nop
or a 5uop loop without. I was able to see a measurable and reproducible performance improvement (more modest) when adding a 6th uop in the following cases: If the 6th uop was a nop
(1, 2, or 4 bytes) it must be between the 1st and 2nd incl
. If the 6th uop was a zero-idiom xor
it can be anywhere. Here are the results when the 6th instruction is between the 1st and 2nd incl
:
Loop looks like:
incl
<6th instruction>
incl
incl
incl
decl
jnz
Times:
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.603 "Cycles"
3,242,400,541 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.604 "Cycles"
3,244,473,075 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.601 "Cycles"
3,239,305,874 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.641 "Cycles"
3,330,420,250 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.649 "Cycles"
3,334,340,019 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 4
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.788 "Cycles"
3,989,749,825 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 1
BYTE -> 2
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.551 "Cycles"
2,893,829,059 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 2
BYTE -> 4
NOP -> 0
BREAK_DEP -> 1
COMPUTE_UOP -> 0
0.604 "Cycles"
3,007,481,786 idq_uops_not_delivered.cycles_fe_was_ok
UOP -> 0
BYTE -> 0
NOP -> 0
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.620 "Cycles"
3,755,030,033 idq_uops_not_delivered.cycles_fe_was_ok
Upvotes: 13
Views: 499