Two similar assembly code. A Substantial difference

Question

I have Ivy-Bridge CPU. The following code takes 3 cycle per iteration:

L1:    
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    add rax, 16
    cmp rax, rcx
    jl L1

The following takes 9 cycles per iteration:

L1:
    movapd xmm1, [rsi+rax] ; X[i], X[i+1]
    mulpd xmm1, xmm2
    movapd xmm0, [rdi+rax] ; Y[i], Y[i+1]
    add rax, 16
    subpd xmm0, xmm1
    movapd [rdi+rax], xmm0 ; Store result
    cmp rax, rcx
    jl L1

The only difference is order ( add rax, 16). And it causes that the program is 3 times slower. Why the difference is so substantial?

Chris Dodd · Accepted Answer

The main reason is that it stores the result in a different location, which also happens to be the location read by the next iteration of the loop.

Doing that interferes with the CPU's out-of-order execution -- the next iteration of the loop can't start until the current iteration completes, due to the data dependency.

I would imagine if you change the store instruction to store back to the same location, the second loop would become substantially faster again:

movapd [rdi+rax-16], xmm0 ; Store result

Two similar assembly code. A Substantial difference

Answers (2)

Related Questions