Reputation: 5063
I have Ivy-Bridge CPU. The following code takes 3 cycle per iteration:
L1:
movapd xmm1, [rsi+rax] ; X[i], X[i+1]
mulpd xmm1, xmm2
movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
subpd xmm0, xmm1
movapd [rdi+rax], xmm0 ; Store result
add rax, 16
cmp rax, rcx
jl L1
The following takes 9 cycles per iteration:
L1:
movapd xmm1, [rsi+rax] ; X[i], X[i+1]
mulpd xmm1, xmm2
movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
add rax, 16
subpd xmm0, xmm1
movapd [rdi+rax], xmm0 ; Store result
cmp rax, rcx
jl L1
The only difference is order ( add rax, 16
). And it causes that the program is 3 times slower. Why the difference is so substantial?
Upvotes: 3
Views: 73
Reputation: 126175
The main reason is that it stores the result in a different location, which also happens to be the location read by the next iteration of the loop.
Doing that interferes with the CPU's out-of-order execution -- the next iteration of the loop can't start until the current iteration completes, due to the data dependency.
I would imagine if you change the store instruction to store back to the same location, the second loop would become substantially faster again:
movapd [rdi+rax-16], xmm0 ; Store result
Upvotes: 5
Reputation: 64904
They have a substantially different dependency structure.
With the moved add rax, 16
, now the place the result is written back to is the next item. So in the next iteration, it reads what it just wrote there, and so on.
So now all of a sudden there is a loop-carried dependency of movapd xmm0, .. \ subpd \ movapd [next], xmm0
, which has a latency of 9 cycles on Ivy (3 per instruction as it happens).
Before, the only loop-carried dependency was the trivial one on rax
.
Upvotes: 4