Reputation: 251
I am trying to understand the overhead in [blk_account_io_completion][1]
in Linux block layer. Using perf annotate
I get the following snippet (abridged). Can someone shed some light on the reason the add
and test
instruction have such overheads compared to their neighboring instruction which are executed with them?
: part_stat_add(cpu, part, sectors[rw], bytes >> 9);
0.13 : ffffffff813336eb: movsxd r8,r8d
0.00 : ffffffff813336ee: lea rdx,[rax*8+0x0]
0.00 : ffffffff813336f6: mov rcx,QWORD PTR [rdi+0x210]
72.04 : ffffffff813336fd: add rcx,QWORD PTR [r8*8-0x7e2df6a0]
0.22 : ffffffff81333705: add QWORD PTR [rcx+rdx*1],rsi
0.61 : ffffffff81333709: mov eax,DWORD PTR [rdi+0x1f4]
26.52 : ffffffff8133370f: test eax,eax
0.00 : ffffffff81333711: je ffffffff81333733 <blk_account_io_completion+0x83>
Upvotes: 3
Views: 144
Reputation: 3129
One possible reason is that these instructions happen to be pointed to by the instruction pointer when a sample is taken. A typical x86 CPU can retire up to 4 instructions per cycle, but when it does so and a sample is token, the program counter will point to just one instruction, not all those four.
Here is an example - see below. Simple plain loop with a bunch of nop instructions. Note how clockticks distribute over this profile with exactly three instructions in the gaps. This may be similar to the effect you are seeing.
Alternatively, it could be that mov rcx,QWORD PTR [rdi+0x210]
and mov eax,DWORD PTR [rdi+0x1f4]
often miss the cache with the cycles spent on that being attributed to the next instruction, like see here.
│ Disassembly of section .text: │ │ 00000000004004ed : │ push %rbp │ mov %rsp,%rbp │ movl $0x0,-0x4(%rbp) │ ↓ jmp 25 14.59 │ d: nop │ nop │ nop 0.03 │ nop 14.58 │ nop │ nop │ nop 0.08 │ nop 13.89 │ nop │ nop 0.01 │ nop 0.08 │ nop 13.99 │ nop │ nop 0.01 │ nop 0.05 │ nop 13.92 │ nop │ nop 0.01 │ nop 0.07 │ nop 14.44 │ addl $0x1,-0x4(%rbp) 0.33 │25: cmpl $0x3fffffff,-0x4(%rbp) 13.90 │ ↑ jbe d │ pop %rbp │ ← retq
Upvotes: 3