Reputation: 5073
My CPU is IvyBridge. Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax] ; c[i]
mulsd xmm3, xmm1 ; c[i]*x^i
mulsd xmm1, xmm2 ; x^(i+1)
addsd xmm0, xmm3 ; sum += c[i]*x^i
add eax, 8
cmp eax, coeff_end
jb L1
And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).
We have two loop-carried dependency ( I skipped add eax, 8
):
1. mulsd xmm1, xmm2
latency of 5 cycles
2. addsd xmm0, xmm3
latency of 3 cycles
Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?
I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.
Please explain why it does take 6/7 cycles instead of just 5 cycles?
Upvotes: 0
Views: 98
Reputation: 365637
Probably resource conflicts are sometimes delaying the loop-carried mulsd
. uop scheduling chooses oldest-first (out of uops that have their inputs ready), not critical-path-first, so the mulsd xmm3, xmm1
probably steals the execution port and delays the loop-carried mulsd xmm1, xmm2
by a cycle occasionally.
There are probably also occasional pipeline stalls from cache misses, since hardware prefetching doesn't work across page boundaries. Test for this by putting an outer repeat-loop around an inner loop over ~128kiB of data (1/2 L2 cache size).
prefetch should have no problem keeping up with your loop normally, though: One 64b load per 5 cycles is about 1/5th of main memory bandwidth.
You could also use float
instead of double
Does your data have any denormals? Those are slow (but NaN isn't, with SSE).
Upvotes: 1