Reputation: 4782
I'm aligning branch targets with NOPs, and sometimes the CPU executes these NOPs, up to 15 NOPs. How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitectures as well. How many cycles may it take to execute a sequence of 15 NOPs? I want to know whether the extra code size and extra execution time of adding these NOPs worth its price. This is not me who adding these NOPs but an assembler automatically whenever I write an align
directive.
Update: I have managed the assembler to insert multibyte NOP
s automatically.
Upvotes: 8
Views: 1699
Reputation: 64905
Skylake can generally execute four single-byte nops in one cycle. This has been true at least back to the Sandy Bridge (hereafter SnB) micro-architecture.
Skylake, and others back to SnB, will also generally be able to execute four longer-than-one-byte nop
s in one cycle as well, unless they are so long as to run into front-end limitations.
The existing answers are much more complete and explain why you might not want to use such single-byte nop
instructions so I won't add more, but it's nice to have one answer that just answers the headline question clearly, I think.
Upvotes: 4
Reputation: 364128
See also Cody's answer for lots of good stuff I'm leaving out because he covered it already.
Never use multiple 1-byte NOPs. All assemblers have ways to get long NOPs; see below.
15 NOPs take 3.75c to issue at the usual 4 per clock, but might not slow down your code at all if it was bottlenecked on a long dependency chain at that point. They do take up space in the ROB all the way until retirement. The only thing they don't do is use an execution port. The point is, CPU performance isn't additive. You can't just say "this takes 5 cycles and this takes 3, so together they will take 8". The point of out-of-order execution is to overlap with surrounding code.
The worse effect of many 1 byte short-NOPs on SnB-family is that they tend to overflow the uop-cache limit of 3 lines per aligned 32B chunk of x86 code. This would mean that the whole 32B block always has to run from the decoders, not the uop cache or loop buffer. (The loop buffer only works for loops that have all their uops in the uop cache).
You should only ever have at most 2 NOPs in a row that actually execute, and then only if you need to pad by more than 10B or 15B or something. (Some CPUs do very badly when decoding instructions with very many prefixes, so for NOPs that actually execute it's probably best not to repeat prefixes out to 15B (the max x86 instruction length).
YASM defaults to making long NOPs. For NASM, use the smartalign
standard macro package, which isn't enabled by default. It forces you to pick a NOP strategy.
%use smartalign
ALIGNMODE p6, 32 ; p6 NOP strategy, and jump over the NOPs only if they're 32B or larger.
IDK if 32 is optimal. Also, beware that the longest NOPs might use a lot of prefixes and decode slowly on Silvermont, or on AMD. Check the NASM manual for other modes.
The GNU assembler's .p2align
directive gives you some conditional behaviour: .p2align 4,,10
will align to 16 (1<<4), but only if that skips 10 bytes or fewer. (The empty 2nd arg means the filler is NOPs, and the power-of-2 align name is because plain .align
is power-of-2 on some platforms but byte-count on others). gcc often emits this before the top of loops:
.p2align 4,,10
.p2align 3
.L7:
So you always get 8-byte alignment (unconditional .p2align 3
), but maybe also 16 unless that would waste more than 10B. Putting the larger alignment first is important to avoid getting e.g. a 1-byte NOP and then an 8-byte NOP instead of a single 9-byte NOP.
It's probably possible to implement this functionality with a NASM macro.
Missing features no assembler has (AFAIK):
It's a good thing alignment for decode bottlenecks isn't usually very important anymore, because tweaking it usually involves manual assemble/disassemble/edit cycles, and has to get looked at again if the preceding code changes.
Especially if you have the luxury of tuning for a limited set of CPUs, test and don't pad if you don't find a perf benefit. In a lot of cases, especially for CPUs with a uop cache and/or loop buffer, it's ok not to align branch targets within functions, even loops.
Some of the performance-variation due to varying alignment is that it makes different branches alias each other in the branch-prediction caches. This secondary subtle effect is still present even when the uop cache works perfectly and there are no front-end bottlenecks from fetching mostly-empty lines from the uop cache.
See also Performance optimisations of x86-64 assembly - Alignment and branch prediction
Upvotes: 4
Reputation: 244722
This is not me who adding these NOPs but an assembler. It is pretty dumb and do not support options (BASM) for alignment - there is just one option - boundary size.
I don't know what "BASM" is, and I can't find any reference to it online (except this, which obviously isn't x86), but if it doesn't support multi-byte NOPs, you really need a different assembler. This is just really basic stuff that's been in the Intel and AMD architecture manuals for years. The Gnu assembler can do this for ALIGN directives, and so can Microsoft's MASM. The open-source NASM and YASM assemblers support this as well, and either of these can be integrated into any existing build system easily.
By multi-byte NOPs, I mean the following, which you can find in AMD and Intel processor manuals:
Length | Mnemonic | Opcode Bytes
---------|-------------------------------------------|-------------------------------------
1 byte | NOP | 90
2 bytes | 66 NOP | 66 90
3 bytes | NOP DWORD [EAX] | 0F 1F 00
4 bytes | NOP DWORD [EAX + 00H] | 0F 1F 40 00
5 bytes | NOP DWORD [EAX + EAX*1 + 00H] | 0F 1F 44 00 00
6 bytes | 66 NOP DWORD [EAX + EAX*1 + 00H] | 66 0F 1F 44 00 00
7 bytes | NOP DWORD [EAX + 00000000H] | 0F 1F 80 00 00 00 00
8 bytes | NOP DWORD [EAX + EAX*1 + 00000000H] | 0F 1F 84 00 00 00 00 00
9 bytes | 66 NOP DWORD [EAX + EAX*1 + 00000000H] | 66 0F 1F 84 00 00 00 00 00
The sequence recommendations offered by the two manufacturers diverge slightly after 9 bytes, but NOPs that long are…not terribly common. And probably don't matter very much, since the extremely long NOP instructions with the excessive number of prefixes are going to degrade performance anyway. These work all the way back to the Pentium Pro, so they are universally supported today.
Agner Fog has this to say about multi-byte NOPs:
The multi-byte NOP instruction has the opcode
0F 1F
+ a dummy memory operand. The length of the multi-byte NOP instruction can be adjusted by optionally adding 1 or 4 bytes of displacement and a SIB byte to the dummy memory operand and by adding one or more66H
prefixes. An excessive number of prefixes can cause delay on older microprocessors, but at least two prefixes is acceptable on most processors. NOPs of any length up to 10 bytes can be constructed in this way with no more than two prefixes. If the processor can handle multiple prefixes without penalty then the length can be up to 15 bytes.
All of the redundant/superfluous prefixes are simply ignored. The advantage, of course, is that many newer processors have lower decode rates for multi-byte NOPs, making them more efficient. They will be faster than a series of 1-byte NOP (0x90
) instructions.
Perhaps even better than multi-byte NOPs for alignment is using longer forms of the instructions that you're already using in your code. These lengthier encodings don't take any longer to execute (they affect only decode bandwidth), so they're faster/cheaper than NOPs. Examples of this are:
INC
, DEC
, PUSH
, POP
, etc., instead of the short versionsADD
instead of INC
or LEA
instead of MOV
.Agner Fog's manuals speak at length about and give examples of these techniques, as well.
I don't know of any assemblers that will do these conversions/optimizations for you automatically (assemblers pick the shortest version, for obvious reasons), but they usually have a strict mode where you can force a particular encoding to be used, or you can just manually emit the instruction bytes. You only do this in highly performance-sensitive code anyway, where the work will actually pay off, so that limits the scope of the effort required substantially.
I want to know whether extra code size and extra execution time of adding these NOPs worth its price.
In general, no. While data alignment is extremely important and essentially free (size of the binary notwithstanding), code alignment is a lot less important. There are cases in tight loops where it can make a significant difference, but this only matters in hot spots in your code, which your profiler will already be identifying, and then you can perform the manipulations to manually align the code if necessary. Otherwise, I wouldn't worry about it.
It makes sense to align functions, as the padding bytes between them are never executed (rather than using NOPs here, you'll often see INT 3
or an invalid instruction, like UD2
), but I wouldn't go around aligning all of your branch targets within functions simply as a matter of course. Only do it in known critical inner loops.
As ever, Agner Fog talks about this, and says it better than I could:
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small. We may align subroutine entries by the cache line size (typically 64 bytes) if the subroutine is part of a critical hot spot and the preceding code is unlikely to be executed in the same context.
A disadvantage of code alignment is that some cache space is lost to empty spaces before the aligned code entries.
In most cases, the effect of code alignment is minimal. So my recommendation is to align code only in the most critical cases like critical subroutines and critical innermost loops.
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before the subroutine entry to make the address divisible by 8, 16, 32 or 64, as desired. The assembler does this with theALIGN
directive. TheNOP
's that are inserted will not slow down the performance because they are never executed.It is more problematic to align a loop entry because the preceding code is also executed. It may require up to 15
NOP
's to align a loop entry by 16. TheseNOP
's will be executed before the loop is entered and this will cost processor time. It is more efficient to use longer instructions that do nothing than to use a lot of single-byteNOP
's. The best modern assemblers will do just that and use instructions likeMOV EAX,EAX
andLEA EBX,[EBX+00000000H]
to fill the space before anALIGN nn
statement. TheLEA
instruction is particularly flexible. It is possible to give an instruction likeLEA EBX,[EBX]
any length from 2 to 8 by variously adding a SIB byte, a segment prefix and an offset of one or four bytes of zero. Don't use a two-byte offset in 32-bit mode as this will slow down decoding. And don't use more than one prefix because this will slow down decoding on older Intel processors.Using pseudo-NOPs such as
MOV RAX,RAX
andLEA RBX,[RBX+0]
as fillers has the disadvantage that it has a false dependence on the register, and it uses execution resources. It is better to use the multi-byte NOP instruction which can be adjusted to the desired length. The multi-byte NOP instruction is available in all processors that support conditional move instructions, i.e. Intel PPro, P2, AMD Athlon, K7 and later.An alternative way of aligning a loop entry is to code the preceding instructions in ways that are longer than necessary. In most cases, this will not add to the execution time, but possibly to the instruction fetch time.
He also goes on to show an example of another way to align an inner loop by moving the preceding subroutine entry. This is kind of awkward, and requires some manual adjustment even in the best of assemblers, but it may be the most optimal mechanism. Again, this only matters in critical inner loops on the hot path, where you're probably already digging in and micro-optimizing anyway.
Anecdotally, I've benchmarked code that I was in the middle of optimizing several times, and didn't find very much if any benefit to aligning a loop branch target. For example, I was writing an optimized strlen
function (Gnu libraries have one, but Microsoft's don't), and tried aligning the target of the main inner loop on 8-byte, 16-byte, and 32-byte boundaries. None of these made much of a difference, especially not when compared to the other drastic performance headway that I was making in rewriting the code.
And beware that if you're not optimizing for a specific processor, you can make yourself crazy trying to find the best "generic" code. When it comes to the effect of alignment on speed, things can vary wildly. A poor alignment strategy is often worse than no alignment strategy at all.
A power-of-two boundary is always a good idea, but this pretty readily achieved without any extra effort. Again, don't dismiss alignment out of hand, because it can matter, but by the same token, don't obsess about trying to align every branch target.
Alignment used to be a bit bigger deal on the original Core 2 (Penryn and Nehalem) microarchitecture, where substantial decode bottlenecks meant that, despite a 4-wide issue width, you had a hard time keeping its execution units busy. With the introduction of the µop cache in Sandy Bridge (one of the few nice features of the Pentium 4 that was ultimately reintroduced into the P6 extended family), the front-end throughput was increased pretty significantly, and this became a lot less of a problem.
Frankly, compilers aren't very good at making these types of optimizations, either. The -O2
switch for GCC implies the -falign-functions
, -falign-jumps
, -falign-loops
, and -falign-labels
switches, with a default preference to align on 8-byte boundaries. This is a pretty blunt approach, and mileage varies. As I linked above, reports vary about whether disabling this alignment and going for compact code might actually increase performance. Moreover, about the best you're going to see a compiler doing is inserting multi-byte NOPs. I haven't seen one that uses longer forms of instructions or drastically rearranges code for alignment purposes. So we still have a long way to go, and it's a very difficult problem to solve. Some people are working on it, but that just goes to show how intractable the problem really is: "Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness." (Note that, while interesting, that paper hails from the early Core 2 days, which suffered more than most from misalignment penalties, as I mentioned earlier. I'm not sure if you'd see the same drastic improvements on today's microarchitectures, but I can't say for sure either way, because I haven't run the test. Maybe Google will hire me and I can publish another paper?)
How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitecrutes as well. How many cycles may it take to execute a sequence of 15 NOPs?
Questions like this can be answered by looking at Agner Fog's instruction tables and searching for NOP
. I won't bother extracting all of his data into this answer.
In general, though, just know that NOPs are not free. Although they don't require an execution unit/port, they still have to run through the pipeline like any other instruction, and so they are ultimately bottlenecked by the issue (and/or retirement) width of the processor. This generally means you can execute somewhere between 3 to 5 NOPs per clock.
NOPs also still take up space in the µop cache, which mean reduced code density and cache efficiency.
In many ways, you can think of a NOP
as being equivalent to a XOR reg, reg
or MOV
that gets elided in the front-end due to register renaming.
Upvotes: 6