I'm currently in the process of developing a disassembler for the x86_x64 CISC. I have 2 questions regarding prefix instruction decoding: For the following stream: \x9b\x9b\xd9\x30 GCC and objdump outputs fstenv [eax] So they're first reading all prefixes (no more than 15) and then proceed to check the correct instruction using the last prefix read \x9b with \xd9 to make it a fstenv instruction. Capstone on the other hand outputs wait wait fnstenv dword ptr [eax] Now, obviously capstone in on the wrong that it puts 2 wait instructions and not just 1. But should it put wait instructions at all or GCC and objdump is on the right here for consuming all the extra \x9b prefixes for the fstenv instruction? For the following stream: \xf2\x66\x0f\x12\x00 GCC and objdump output data16 movddup xmm0,QWORD PTR [eax] So they're arranging the prefixes in a specific order so \x66 is interpreted before \xf2 thus, and so they're still using the last prefix read \xf2 to determine the instruction movddup . So is they're right here for using this arrange logic of the prefixes or are they wrong? Capstone on the other hand outputs movlpd xmm0, qword ptr [eax] So they're not arranging the prefixes in any order and they're just taking the last prefix read \x66 to determine the instruction movlpd which looks more logical in this case than what GCC and objdump were doing. How is the cpu actually interpreting these streams?

Reputation: 1971

x86 instruction prefix decoding

I'm currently in the process of developing a disassembler for the x86_x64 CISC. I have 2 questions regarding prefix instruction decoding:

For the following stream:
```
\x9b\x9b\xd9\x30
```
GCC and objdump outputs
```
fstenv [eax]
```
So they're first reading all prefixes (no more than 15) and then proceed to check the correct instruction using the last prefix read \x9b with \xd9 to make it a fstenv instruction.

Capstone on the other hand outputs
```
wait
wait
fnstenv dword ptr [eax] 
```
Now, obviously capstone in on the wrong that it puts 2 wait instructions and not just 1. But should it put wait instructions at all or GCC and objdump is on the right here for consuming all the extra \x9b prefixes for the fstenv instruction?
For the following stream:
```
\xf2\x66\x0f\x12\x00
```
GCC and objdump output
```
data16 movddup xmm0,QWORD PTR [eax]
```
So they're arranging the prefixes in a specific order so \x66 is interpreted before \xf2 thus, and so they're still using the last prefix read \xf2 to determine the instruction movddup. So is they're right here for using this arrange logic of the prefixes or are they wrong?

Capstone on the other hand outputs

movlpd xmm0, qword ptr [eax]

So they're not arranging the prefixes in any order and they're just taking the last prefix read \x66 to determine the instruction movlpd which looks more logical in this case than what GCC and objdump were doing.

How is the cpu actually interpreting these streams?

Upvotes: 5

Answers (2)

Peter Cordes

Reputation: 365772

9B 9B D9 30 Capstone is correct, and objdump's fstenv is also mostly correct.

fstenv isn't a real machine instruction, it's a pseudo-instruction for fwait + fnstenv. Notice that machine code for fnstenv listed in the manual entry is D9 /6, while fstenv adds a 9B before that.

9B is not an instruction prefix, it's a separate 1-byte instruction called wait aka fwait. On original 8086+8087, this was necessary because 8087 was a truly separate coprocessor. How did the 8086 interface with the 8087 FPU coprocessor?. See the comments under the top answer there; before 286 they weren't tightly coupled enough for the main CPU to know if there were pending FPU exceptions.

I'm not sure of the details, but fnstsw on an 8086 / 186 could maybe read an old version of the status word that didn't have the latest flags set from a masked exception. Or maybe it only matters with unmasked exceptions, for getting the FP exception from a multiply or whatever before the fnst* instruction. According to Stephen Kitt's comments, 286 and newer "checks its TEST line before executing an NPX instruction", automatically FWAITing.

And of course CPUs with integrated FPUs have no trouble with precise FP exceptions, and synchronous behaviour, so fwait is a waste of space there.

Capstone's wait / wait / fnstenv dword ptr [eax] is thus more explicit, because as far as the CPU is concerned, it really is 3 instructions. (As Andreas's answer shows modern x86 perf counters record).

Objdump treats two preceding fwait instructions as part of a single fstenv. It would be more accurate to decode it as fwait ; fstenv dword ptr [eax] because Intel's manual only documents fstenv as including a single fwait opcode. But an extra fwait has no architectural effect.

Part 2

As Andreas's answer shows, f2 66 0f 12 00 decodes as a movddup (64-bit broadcast) on real hardware, with a meaningless 66 (data16 operand-size) prefix. objdump is correct, at least for that CPU.

The documented encoding for movddup is F2 0F 12, where F2 is a mandatory prefix, and 0F is the escape byte.

We might have expected it to decode as 66 0F 12 /r MOVLPD with a meaningless F2 REP prefix, but that's not the case; capstone is wrong. There are rules for mandatory prefix bytes: order for encoding x86 instruction prefix bytes including "the 66 prefix is ignored if either F2 or F3 are used".

I'm not 100% sure this sequence is guaranteed to decode as movddup on all hardware, of if this is merely how Intel Sandybridge-family happens to decode it. As @fuz commented, there is a required order for mandatory prefixes and getting it wrong gives undefined behaviour (i.e. a specific CPU might decode it to anything, especially some future CPU where a different sequence of prefixes is mandatory for some other instruction.)

Upvotes: 3

Andreas Abel

Reputation: 1468

How your CPU actually interprets these streams can be tested relatively easily.

For the first stream, you can use my tool nanoBench. You can use the command

sudo ./nanoBench.sh -asm_init "mov RAX, R14" -asm ".byte 0x9b, 0x9b, 0xd9, 0x30".

This command first sets RAX to a valid memory address, and then runs your stream multiple times. On my Core i7-8700K, I get the following output (for the fixed-function performance counters):

Instructions retired: 3.00
Core cycles: 73.00
Reference cycles: 62.70

We can see that the CPU executes three instructions, so Capstone seems to be correct.

You can analyze the second stream using the debug mode of nanoBench:

sudo ./nanoBench.sh -unroll 1 -asm "mov RAX, R14; mov qword ptr [RAX], 1234; .byte 0xf2, 0x66, 0x0f, 0x12, 0x00" -debug.

This will - inside gdb - first execute the asm code, and then generate a breakpoint trap. We can now look at the current value of the XMM0 register:

(gdb) p $xmm0.v2_int64
$1 = {1234, 1234}

So the high and the low quadword of XMM0 now have the same value as the memory at address RAX, which indicates that the CPU executed the movddup instruction.

You can also analyze the second stream without using nanoBench. To do this, you can save the following assembler code in a file asm.s.

.intel_syntax noprefix

.global _start
_start:
    mov RAX, RSP
    mov qword ptr [RAX], 1234   
    .byte 0xf2, 0x66, 0x0f, 0x12, 0x00
    int 0x03 /* breakpoint trap */

Then, you can build it using

as asm.s -o asm.o
ld -s asm.o -o asm

Now you can analyze it with gdb using gdb ./asm:

(gdb) r
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400088 in ?? ()
(gdb) p $xmm0.v2_int64
$2 = {1234, 1234}

Upvotes: 5

x86 instruction prefix decoding

Answers (2)

Part 2

Related Questions