Reputation: 15263
Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?
label: mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".
UPD According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.
SOLVED
@Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost
, AMD C1E Support
and AMD K8 Cool&Quiet Control
, consumption of my "six instructions" stabilized on 3 clock ticks :-)
Upvotes: 6
Views: 442
Reputation: 471209
I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx
that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx
register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi
register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...
Upvotes: 8
Reputation: 41374
At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.
Upvotes: 2