Mohammad Hedayati
Mohammad Hedayati

Reputation: 251

Big difference in overhead caused by instructions in straight-line code

I am trying to understand the overhead in [blk_account_io_completion][1] in Linux block layer. Using perf annotate I get the following snippet (abridged). Can someone shed some light on the reason the add and test instruction have such overheads compared to their neighboring instruction which are executed with them?

         :                      part_stat_add(cpu, part, sectors[rw], bytes >> 9);
    0.13 :        ffffffff813336eb:       movsxd r8,r8d
    0.00 :        ffffffff813336ee:       lea    rdx,[rax*8+0x0]
    0.00 :        ffffffff813336f6:       mov    rcx,QWORD PTR [rdi+0x210]
   72.04 :        ffffffff813336fd:       add    rcx,QWORD PTR [r8*8-0x7e2df6a0]
    0.22 :        ffffffff81333705:       add    QWORD PTR [rcx+rdx*1],rsi
    0.61 :        ffffffff81333709:       mov    eax,DWORD PTR [rdi+0x1f4]
   26.52 :        ffffffff8133370f:       test   eax,eax
    0.00 :        ffffffff81333711:       je     ffffffff81333733 <blk_account_io_completion+0x83>

Upvotes: 3

Views: 144

Answers (1)

Alexey Alexandrov
Alexey Alexandrov

Reputation: 3129

One possible reason is that these instructions happen to be pointed to by the instruction pointer when a sample is taken. A typical x86 CPU can retire up to 4 instructions per cycle, but when it does so and a sample is token, the program counter will point to just one instruction, not all those four.

Here is an example - see below. Simple plain loop with a bunch of nop instructions. Note how clockticks distribute over this profile with exactly three instructions in the gaps. This may be similar to the effect you are seeing.

Alternatively, it could be that mov rcx,QWORD PTR [rdi+0x210] and mov eax,DWORD PTR [rdi+0x1f4] often miss the cache with the cycles spent on that being attributed to the next instruction, like see here.

       │    Disassembly of section .text:
       │
       │    00000000004004ed :
       │      push   %rbp
       │      mov    %rsp,%rbp
       │      movl   $0x0,-0x4(%rbp)
       │    ↓ jmp    25
 14.59 │ d:   nop
       │      nop
       │      nop
  0.03 │      nop
 14.58 │      nop
       │      nop
       │      nop
  0.08 │      nop
 13.89 │      nop
       │      nop
  0.01 │      nop
  0.08 │      nop
 13.99 │      nop
       │      nop
  0.01 │      nop
  0.05 │      nop
 13.92 │      nop
       │      nop
  0.01 │      nop
  0.07 │      nop
 14.44 │      addl   $0x1,-0x4(%rbp)
  0.33 │25:   cmpl   $0x3fffffff,-0x4(%rbp)
 13.90 │    ↑ jbe    d
       │      pop    %rbp
       │    ← retq

Upvotes: 3

Related Questions