Gilgamesz
Gilgamesz

Reputation: 5063

Two similar assembly code. A Substantial difference

I have Ivy-Bridge CPU. The following code takes 3 cycle per iteration:

L1:    
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    add rax, 16
    cmp rax, rcx
    jl L1

The following takes 9 cycles per iteration:

L1:
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    add rax, 16
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    cmp rax, rcx
    jl L1

The only difference is order ( add rax, 16). And it causes that the program is 3 times slower. Why the difference is so substantial?

Upvotes: 3

Views: 73

Answers (2)

Chris Dodd
Chris Dodd

Reputation: 126175

The main reason is that it stores the result in a different location, which also happens to be the location read by the next iteration of the loop.

Doing that interferes with the CPU's out-of-order execution -- the next iteration of the loop can't start until the current iteration completes, due to the data dependency.

I would imagine if you change the store instruction to store back to the same location, the second loop would become substantially faster again:

movapd [rdi+rax-16], xmm0 ; Store result

Upvotes: 5

user555045
user555045

Reputation: 64904

They have a substantially different dependency structure.

With the moved add rax, 16, now the place the result is written back to is the next item. So in the next iteration, it reads what it just wrote there, and so on.

So now all of a sudden there is a loop-carried dependency of movapd xmm0, .. \ subpd \ movapd [next], xmm0, which has a latency of 9 cycles on Ivy (3 per instruction as it happens).

Before, the only loop-carried dependency was the trivial one on rax.

Upvotes: 4

Related Questions