x86: Long loop-carried dependency chain. Why 13 cycles?

Question

I modified the code from a previous experiment (Agner Fog's Optimizing Assembly, example 12.10a) to make it more dependent:

movsd xmm2, [x] 
movsd xmm1, [one] 
xorps xmm0, xmm0  
mov eax, coeff    

L1:
    movsd xmm3, [eax]
    mulsd xmm3, xmm1
    mulsd xmm1, xmm2
    addsd xmm1, xmm3
    add   eax, 8
    cmp eax, coeff_end
    jb L1

And now it takes ~13 cycles per iteration, but I have no idea why so much. Please help me understand.

(update) I'm sorry. Yes, definetely @Peter Cordes is right- it takes 9 cycles per iteration in fact. The misunderstanding is caused by myself. I missed two similar pieces of codes ( instructions swapped), the 13-cycles code is here:

movsd xmm2, [x] 
movsd xmm1, [one] 
xorps xmm0, xmm0  
mov eax, coeff    

L1:
    movsd xmm3, [eax]

    mulsd xmm1, xmm2
    mulsd xmm3, xmm1      
    addsd xmm1, xmm3
    add   eax, 8
    cmp eax, coeff_end
    jb L1

Peter Cordes · Accepted Answer

It runs at exactly one iteration per 9c for me, on a Core2 E6600, which is expected:

      movsd xmm3, [eax] ; independent, depends only on eax

A:    mulsd xmm3, xmm1  ; 5c: depends on xmm1:C from last iteration
B:    mulsd xmm1, xmm2  ; 5c: depends on xmm1:C from last iteration
C:    addsd xmm1, xmm3  ; 3c: depends on xmm1:B from THIS iteration (and xmm3:A from this iteration)

When xmm1:C is ready from iteration i, the next iteration can start calculating:

A: producing xmm3:A in 5c
B: producing xmm1:B in 5c (but there's a resource conflict; these multiplies can't both start in the same cycle in Core2 or IvyBridge, only Haswell and later)

Regardless of which one runs first, both have to finish before C can run. So the loop-carried dependency chain is 5 + 3 cycles, +1c for the resource conflict that stops both multiplies from starting in the same cycle.

Test code that runs at the expected speed:

This slows down to one iteration per ~11c when the array is 8B * 128 * 1024. If you're testing with an even bigger array instead of using a repeat-loop around what you posted, then that's why you're seeing a higher latency.

If a load arrives late, there's no way for the CPU to "catch up", since it delays the loop-carried dependency chain. If the load was only needed in a dependency chain that forked off from the loop-carried chain, then the pipeline could absorb an occasional slow load more easily. So, some loops can be more sensitive to memory delays than others.

        default REL
%macro  IACA_start 0
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro

global _start
_start:
        movsd   xmm2, [x]
        movsd   xmm1, [one]
        xorps   xmm0, xmm0
        mov     ecx, 10000

outer_loop:
        mov     eax, coeff
IACA_start                      ; outside the loop
ALIGN 32                        ; this matters on Core2, .78 insn per cycle vs. 0.63 without
L1:
        movsd   xmm3, [eax]
        mulsd   xmm3, xmm1
        mulsd   xmm1, xmm2
        addsd   xmm1, xmm3
        add     eax, 8
        cmp     eax, coeff_end
        jb      L1
IACA_end
        dec     ecx
        jnz     outer_loop

        ;mov    eax, 1
        ;int    0x80            ; exit() for 32bit code
        xor     edi, edi
        mov     eax, 231        ;  exit_group(0).  __NR_exit = 60.
        syscall


        section .data
x:
one:    dq 1.0

        section .bss
coeff:  resq 24*1024        ; 6 * L1 size.  Doesn't run any faster when it fits in L1 (resb)
coeff_end:

Experimental test

$ asm-link interiteration-test.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 interiteration-test.asm
+ ld -o interiteration-test interiteration-test.o

$ perf stat ./interiteration-test

 Performance counter stats for './interiteration-test':
    928.543744      task-clock (msec)         #    0.995 CPUs utilized          
           152      context-switches          #    0.164 K/sec                  
             1      cpu-migrations            #    0.001 K/sec                  
            52      page-faults               #    0.056 K/sec                  
 2,222,536,634      cycles                    #    2.394 GHz                      (50.14%)
       stalled-cycles-frontend  
       stalled-cycles-backend   
 1,723,575,954      instructions              #    0.78  insns per cycle          (75.06%)
   246,414,304      branches                  #  265.377 M/sec                    (75.16%)
        51,483      branch-misses             #    0.02% of all branches          (74.74%)

   0.933372495 seconds time elapsed

Each branch / every 7 instructions is one iteration of the inner loop.

$ bc -l
bc 1.06.95
1723575954 / 7
246225136.28571428571428571428
# ~= number of branches: good
2222536634 / .
9.026
# cycles per iteration

IACA agrees: 9c per iteration on IvB

(not counting the nops from ALIGN):

$ iaca.sh -arch IVB interiteration-test
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - interiteration-test
Binary Format - 64Bit
Architecture  - IVB
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 9.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 2.0    0.0  | 1.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 2.0  |
-------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | movsd xmm3, qword ptr [eax]
|   1    | 1.0       |     |           |           |     |     |    | mulsd xmm3, xmm1
|   1    | 1.0       |     |           |           |     |     | CP | mulsd xmm1, xmm2
|   1    |           | 1.0 |           |           |     |     | CP | addsd xmm1, xmm3
|   1    |           |     |           |           |     | 1.0 |    | add eax, 0x8
|   1    |           |     |           |           |     | 1.0 |    | cmp eax, 0x63011c
|   0F   |           |     |           |           |     |     |    | jb 0xffffffffffffffe7
Total Num Of Uops: 6

x86: Long loop-carried dependency chain. Why 13 cycles?

Answers (2)

Test code that runs at the expected speed:

Experimental test

IACA agrees: 9c per iteration on IvB

Related Questions