Reputation:
This takes 1 cycle per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And this takes 2 cycles per iteration. but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My CPU is IvyBridge.
Upvotes: 4
Views: 85
Reputation: 365637
movnti
is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.
So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.
The nop
is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.
See also the x86 tag wiki for more links to how CPUs work.
The 3rd loop is probably slow because mov rbx, [array+16]
is probably loading from the same cache line that movnti
evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti
, apparently it can rewrite some bytes in the same fill-buffer.)
Upvotes: 2