Reputation: 97
I'm currently doing an assignment that measures the performance of various x86-64 commands (at&t syntax).
The command I'm somewhat confused on is the "unconditional jmp" command. This is how I've implemented it:
.global uncond
uncond:
.rept 10000
jmp . + 2
.endr
mov $10000, %rax
ret
It's fairly simple. The code creates a function called "uncond" which uses the .rept directive to call the jmp command 10000 times, then sets the return value to the number of times you called the jmp command.
"." in at&t syntax means the current address, which I increase by 2 bytes in order to account for the jmp instruction itself (so jmp . + 2 should simply move to the next instruction).
Code that I haven't shown calculate the number of cycles it takes to process the 10000 commands.
My results say jmp is pretty slow (takes 10 cycles to process a single jmp instruction) - but from what I understand about pipelining, unconditional jumps should be very fast (no branch prediction errors).
Am I missing something? Is my code wrong?
Upvotes: 1
Views: 1673
Reputation: 365457
The CPU isn't optimized for no-op jmp
instructions, so it doesn't handle the special case of continuing to decode and pipeline jmp instructions that just jump to the next insn.
CPUs are optimized for loops, though. jmp .
will run at one insn per clock on many CPUs, or one per 2 clocks on some CPUs.
A jump creates a bubble in instruction fetching. A single well-predicted jump is ok, but running nothing but jumps is problematic. I reproduced your results on a core2 E6600 (Merom/Conroe microarch):
# jmp-test.S
.globl _start
_start:
mov $100000, %ecx
jmp_test:
.rept 10000
jmp . + 2
.endr
dec %ecx
jg jmp_test
mov $231, %eax
xor %ebx,%ebx
syscall # exit_group(0)
build and run with:
gcc -static -nostartfiles jmp-test.S
perf stat -e task-clock,cycles,instructions,branches,branch-misses ./a.out
Performance counter stats for './a.out':
3318.616490 task-clock (msec) # 0.997 CPUs utilized
7,940,389,811 cycles # 2.393 GHz (49.94%)
1,012,387,163 instructions # 0.13 insns per cycle (74.95%)
1,001,156,075 branches # 301.679 M/sec (75.06%)
151,609 branch-misses # 0.02% of all branches (75.08%)
3.329916991 seconds time elapsed
From another run:
7,886,461,952 L1-icache-loads # 2377.687 M/sec (74.95%)
7,715,854 L1-icache-load-misses # 2.326 M/sec (50.08%)
1,012,038,376 iTLB-loads # 305.119 M/sec (75.06%)
240 iTLB-load-misses # 0.00% of all iTLB cache hits (75.02%)
(Numbers in (%) at the end of each line are how much of the total run time that counter was active for: perf
has to multiplex for you when you ask it to count more things than the HW can count at once).
So it's not actually I-cache misses, it's just instruction fetch/decode frontend bottlenecks caused by constant jumps.
My SnB machine is broken, so I can't test numbers on it, but 8 cycles per jmp sustained throughput is pretty close to your results (which were probably from a different microarchitecture).
For more details, see http://agner.org/optimize/, and other links from the x86 tag wiki.
Upvotes: 1