Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

Question

I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.

This takes 1 cycle per iteration.

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    add rcx, 1 
    cmp rcx, n
    jle .begin

And this takes 2 cycles per iteration. but why?

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    nop
    add rcx, 1 
    cmp rcx, n
    jle .begin

This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.

.begin:
    movnti [array], eax
    mov rbx, [array+16]
    add rcx, 1 
    cmp rcx, n
    jle .begin

My CPU is IvyBridge.

Peter Cordes · Accepted Answer

movnti is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.

So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.

The nop is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.

See also the x86 tag wiki for more links to how CPUs work.

The 3rd loop is probably slow because mov rbx, [array+16] is probably loading from the same cache line that movnti evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti, apparently it can rewrite some bytes in the same fill-buffer.)

Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

Answers (1)

Related Questions