Gilgamesz
Gilgamesz

Reputation: 5073

Two loop-carried dependency chain. Which one matters?

My CPU is IvyBridge. Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:

    movsd xmm2, [x] 
    movsd xmm1, [one] 
    xorps xmm0, xmm0  
    mov eax, coeff    

    L1:
        movsd xmm3, [eax] ; c[i]
        mulsd xmm3, xmm1  ; c[i]*x^i
        mulsd xmm1, xmm2  ; x^(i+1)
        addsd xmm0, xmm3  ; sum += c[i]*x^i
        add   eax, 8
        cmp eax, coeff_end
        jb L1

And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).

We have two loop-carried dependency ( I skipped add eax, 8): 1. mulsd xmm1, xmm2 latency of 5 cycles 2. addsd xmm0, xmm3 latency of 3 cycles

Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?

I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.

Please explain why it does take 6/7 cycles instead of just 5 cycles?

Upvotes: 0

Views: 98

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365637

Probably resource conflicts are sometimes delaying the loop-carried mulsd. uop scheduling chooses oldest-first (out of uops that have their inputs ready), not critical-path-first, so the mulsd xmm3, xmm1 probably steals the execution port and delays the loop-carried mulsd xmm1, xmm2 by a cycle occasionally.

There are probably also occasional pipeline stalls from cache misses, since hardware prefetching doesn't work across page boundaries. Test for this by putting an outer repeat-loop around an inner loop over ~128kiB of data (1/2 L2 cache size).

prefetch should have no problem keeping up with your loop normally, though: One 64b load per 5 cycles is about 1/5th of main memory bandwidth.

You could also use float instead of double


Does your data have any denormals? Those are slow (but NaN isn't, with SSE).

Upvotes: 1

Related Questions