Reputation: 333
As a small recall, the x86 architecture defines 0x0F 0x1F [mod R/M]
as a multi-byte NOP.
Now I'm looking at the specific case of an 8-byte NOP: I have got
0x0F 0x1F 0x84 0x__ 0x__ 0x__ 0x__ 0x__
where the last 5 bytes have got arbitrary values.
The third byte, [mod R/M]
, split up gives:
mod = 10b
: argument is reg1
+ a DWORD-sized displacementreg2 = 000b
: (we don't care)reg1 = 100b
: indicates that the argument is instead the SIB
byte + a DWORD-sized displacement.Now, as a concrete example, if I take
0x0F 0x1F 0x84 0x12 0x34 0x56 0x78 0x9A
I've got
SIB = 0x12
displacement = 0x9A785634
: a DWORDNow I add the 0x66
instruction prefix to indicate that the displacement should be a WORD instead of a DWORD:
0x66 0x0F 0x1F 0x84 0x12 0x34 0x56 0x78 0x9A
I expect 0x78 0x9A
to be 'cut off' and be treated as a new instruction. However, when compiling this and running objdump
on the resulting executable, it still uses all 4 bytes (a DWORD) as displacement.
Am I misunderstanding the meaning of 'displacement' in this context? Or does the 0x66
prefix not have any effect on multi-byte NOP instructions?
Upvotes: 7
Views: 7432
Reputation: 76547
The 66H
prefix overrides the size of the operand to 16 bit.
It does not override the size of the address, if you want that you use 67H
. (But don't in 32-bit code; that tends to cause an LCP stall, especially if you use nopw [bx+0x1111]
with a disp16 instead of the disp32 you'd get from the same ModRM encoding without a 67H
prefix.)
Here's a list of all traditional x86 prefixes (not REX or VEX/EVEX).
F0h = LOCK -- locks memory reads/writes
String prefixes
F3h = REP, REPE
F2h = REPNE
Segment overrides
2Eh = CS
36h = SS
3Eh = DS
26h = ES
64h = FS
65h = GS
Operand override
66h. Changes size of data expected to 16-bit
Address override
67h. Changes size of address expected to 16-bit (in 32-bit mode)
However it is best not to create your own NOP instructions, but stick to the recommended (multi-byte) NOPs.
AMD recommend the following:
Table 4-9. Recommended Multi-Byte Sequence of NOP Instruction
bytes sequence encoding
1 90H NOP
2 66 90H 66 NOP
3 0F 1F 00H NOP DWORD ptr [EAX]
4 0F 1F 40 00H NOP DWORD ptr [EAX + 00H]
5 0F 1F 44 00 00H NOP DWORD ptr [EAX + EAX*1 + 00H]
6 66 0F 1F 44 00 00H NOP DWORD ptr [AX + AX*1 + 00H]
7 0F 1F 80 00 00 00 00H NOP DWORD ptr [EAX + 00000000H]
8 0F 1F 84 00 00 00 00 00H NOP DWORD ptr [AX + AX*1 + 00000000H]
9 66 0F 1F 84 00 00 00 00 00H NOP DWORD ptr [AX + AX*1 + 00000000H]
(The disassembly in this table is wrong: it shows [AX]
in the instructions that use a 66H
prefix. That prefix sets the operand-size to 16, but the address-size is unmodified. And AX isn't encodeable in a 16-bit addressing mode if this was using a 67H
prefix in 32-bit mode. And 16-bit address size would mean a 2-byte displacement not 4-byte, and removes the possibility of a SIB byte. This is part of why 67H
is slow to decode in 32-bit mode on Intel CPUs, giving false LCP stalls.)
Intel does not mind up to 3 redundant prefixes, so nop's up to 11 bytes can be constructed like so.
10 66 66 0F 1F 84 00 00 00 00 00H NOP DWORD ptr [AX + AX*1 + 00000000H]
11 66 66 66 0F 1F 84 00 00 00 00 00H NOP DWORD ptr [AX + AX*1 + 00000000H]
You can also eliminate NOPs by prefixing normal instructions with redundant prefixes. e.g.
rep mov reg,reg //one extra byte
rep
is normally ignored, but new CPU extension often use rep
to make an old opcode mean something different. A safer choice is a prefix that can be valid for that opcode, but has no effect on register operands, like a segment override. Or like a DS
prefix when that's already the default, or in 64-bit mode where CS/DS/ES/SS bases are all 0.
Or choosing registers that need a REX prefix, so the assembler must use longer versions of the same instruction.
test r8d,r8d is one byte longer than: test edx,edx
The instructions with immediate operands have short and long versions (except for test
).
and edx,7 //short imm8
and edx,0000007 //long imm32
Most assembler will helpfully shorten all instructions for you, so you'll have to code the longer instructions yourself using db
, or NASM/YASM and edx, strict dword 7
. See What methods can be used to efficiently extend instruction length on modern x86? for more about this in general.
Interspersing these in strategic locations can help you align jump targets without having to incur delays due to the decoding or execution of a NOP.
Remember on most CPUs executing NOPs still uses up resources. Front-end decode / uop-cache / issue slots, and tracking it in the ROB until retirement. Padding other instructions takes up the same extra I-cache space, but without those other costs.
These techniques of extending instructions sometimes get used automatically, when you enable workarounds for the performance pot-hole introduced by Intel's microcode workaround for the JCC erratum in Skylake: see How can I mitigate the impact of the Intel jcc erratum on gcc? This can require extending instructions inside inner loops, where you really don't want to use a NOP, so Intel recommended extending earlier instructions.
Upvotes: 8