Reputation: 1971
I'm currently in the process of developing a disassembler for the x86_x64 CISC. I have 2 questions regarding prefix instruction decoding:
For the following stream:
\x9b\x9b\xd9\x30
GCC
and objdump
outputs
fstenv [eax]
So they're first reading all prefixes
(no more than 15) and then proceed to check the correct instruction
using the last prefix read \x9b
with \xd9
to make it a fstenv
instruction.
Capstone
on the other hand outputs
wait
wait
fnstenv dword ptr [eax]
Now, obviously capstone in on the wrong
that it puts 2 wait
instructions and not just 1. But should it put
wait
instructions at all or GCC
and objdump
is on the right
here for consuming all the extra \x9b
prefixes for the fstenv
instruction?
For the following stream:
\xf2\x66\x0f\x12\x00
GCC
and objdump
output
data16 movddup xmm0,QWORD PTR [eax]
So they're arranging the
prefixes in a specific order so \x66
is interpreted before \xf2
thus, and so they're still using the last prefix read \xf2
to
determine the instruction movddup
. So is they're right here for
using this arrange logic of the prefixes or are they wrong?
Capstone
on the other hand outputs
movlpd xmm0, qword ptr [eax]
So they're not arranging the prefixes in any order and they're just
taking the last prefix read \x66
to determine the instruction
movlpd
which looks more logical in this case than what GCC
and
objdump
were doing.
How is the cpu actually interpreting these streams?
Upvotes: 5
Views: 1640
Reputation: 365772
9B 9B D9 30
Capstone is correct, and objdump's fstenv
is also mostly correct.
fstenv
isn't a real machine instruction, it's a pseudo-instruction for fwait
+ fnstenv
. Notice that machine code for fnstenv
listed in the manual entry is D9 /6
, while fstenv
adds a 9B
before that.
9B
is not an instruction prefix, it's a separate 1-byte instruction called wait
aka fwait
. On original 8086+8087, this was necessary because 8087 was a truly separate coprocessor. How did the 8086 interface with the 8087 FPU coprocessor?. See the comments under the top answer there; before 286 they weren't tightly coupled enough for the main CPU to know if there were pending FPU exceptions.
I'm not sure of the details, but fnstsw
on an 8086 / 186 could maybe read an old version of the status word that didn't have the latest flags set from a masked exception. Or maybe it only matters with unmasked exceptions, for getting the FP exception from a multiply or whatever before the fnst*
instruction. According to Stephen Kitt's comments, 286 and newer "checks its TEST line before executing an NPX instruction", automatically FWAITing.
And of course CPUs with integrated FPUs have no trouble with precise FP exceptions, and synchronous behaviour, so fwait
is a waste of space there.
Capstone's wait
/ wait
/ fnstenv dword ptr [eax]
is thus more explicit, because as far as the CPU is concerned, it really is 3 instructions. (As Andreas's answer shows modern x86 perf counters record).
Objdump treats two preceding fwait
instructions as part of a single fstenv
. It would be more accurate to decode it as fwait
; fstenv dword ptr [eax]
because Intel's manual only documents fstenv
as including a single fwait
opcode. But an extra fwait
has no architectural effect.
As Andreas's answer shows, f2 66 0f 12 00
decodes as a movddup
(64-bit broadcast) on real hardware, with a meaningless 66
(data16 operand-size) prefix. objdump is correct, at least for that CPU.
The documented encoding for movddup
is F2 0F 12
, where F2 is a mandatory prefix, and 0F is the escape byte.
We might have expected it to decode as 66 0F 12 /r MOVLPD
with a meaningless F2 REP prefix, but that's not the case; capstone is wrong. There are rules for mandatory prefix bytes: order for encoding x86 instruction prefix bytes including "the 66 prefix is ignored if either F2 or F3 are used".
I'm not 100% sure this sequence is guaranteed to decode as movddup
on all hardware, of if this is merely how Intel Sandybridge-family happens to decode it. As @fuz commented, there is a required order for mandatory prefixes and getting it wrong gives undefined behaviour (i.e. a specific CPU might decode it to anything, especially some future CPU where a different sequence of prefixes is mandatory for some other instruction.)
Upvotes: 3
Reputation: 1468
How your CPU actually interprets these streams can be tested relatively easily.
For the first stream, you can use my tool nanoBench. You can use the command
sudo ./nanoBench.sh -asm_init "mov RAX, R14" -asm ".byte 0x9b, 0x9b, 0xd9, 0x30"
.
This command first sets RAX
to a valid memory address, and then runs your stream multiple times. On my Core i7-8700K, I get the following output (for the fixed-function performance counters):
Instructions retired: 3.00
Core cycles: 73.00
Reference cycles: 62.70
We can see that the CPU executes three instructions, so Capstone
seems to be correct.
You can analyze the second stream using the debug mode of nanoBench:
sudo ./nanoBench.sh -unroll 1 -asm "mov RAX, R14; mov qword ptr [RAX], 1234; .byte 0xf2, 0x66, 0x0f, 0x12, 0x00" -debug
.
This will - inside gdb
- first execute the asm
code, and then generate a breakpoint trap. We can now look at the current value of the XMM0 register:
(gdb) p $xmm0.v2_int64
$1 = {1234, 1234}
So the high and the low quadword of XMM0 now have the same value as the memory at address RAX, which indicates that the CPU executed the movddup
instruction.
You can also analyze the second stream without using nanoBench. To do this, you can save the following assembler code in a file asm.s
.
.intel_syntax noprefix
.global _start
_start:
mov RAX, RSP
mov qword ptr [RAX], 1234
.byte 0xf2, 0x66, 0x0f, 0x12, 0x00
int 0x03 /* breakpoint trap */
Then, you can build it using
as asm.s -o asm.o
ld -s asm.o -o asm
Now you can analyze it with gdb using gdb ./asm
:
(gdb) r
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400088 in ?? ()
(gdb) p $xmm0.v2_int64
$2 = {1234, 1234}
Upvotes: 5