The Vee
The Vee

Reputation: 11570

What's the reason for padding executable sections with "long NOPs"?

I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,

  (...)
  447454:   c3                              retq   
  447455:   90                              nop
  447456:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)

0000000000447460 <__libc_csu_fini>:
  447460:   f3 c3                           repz retq 

What's the advantage to filling the space with regular NOPs like observed here or here?

Upvotes: 4

Views: 920

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364522

There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.

GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.

GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).


If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.

IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.

MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.

This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.

Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.


BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.

Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.

With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.

Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.

Upvotes: 4

Related Questions