Lupe
Lupe

Reputation: 349

How do (x86) virtual machines generally handle flags?

For a side project, I am attempting to write a semi-programmable x86 virtual machine.

I understand the formats so most of the design is relatively simple, but after executing an instruction with its operands, flags are often changed. It would be very inefficient to check each potential bit, so I was thinking of popping the flag register into the VM, ANDing it, and then setting the VM's flag register. However, this is still a lot of overhead.

This is borderline opinionated, but is there something I am missing?

Upvotes: 3

Views: 1077

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365747

You appear to be asking about emulating x86, not virtualizing it. Since modern x86 hardware supports virtualization, where the CPU runs guest code natively and only traps to the hypervisor for some privileged instructions, that's what the term "virtualization" normally means.


Lazy flag evalution is typical. Instead of actually calculating all the flags, just save the operands from the last instruction that set flags. Then if something actually reads the flags, figure out what the flag values need to be.

This means you don't actually have to calculate PF and AF every time they're written (almost every instruction), only every time they're read (mostly only PUSHF or interrupts, hardly any code ever reads PF (except for FP branches where it means NaN)). Computing PF after every integer instruction is expensive in pure C, since it requires a popcount on the low 8 bits of results. (And I think C compilers generally don't manage to recognize that pattern and use setp themselves, let alone a pushf or lahf to store multiple flags, if compiling an x86 emulator to run on an x86 host. They do sometimes recognize population-count patterns and emit popcnt instructions, though, when targetting host CPUs that have that feature (e.g. -march=nehalem)).

BOCHS uses this technique, and describes the implementation in in some detail in the Lazy Flags section of this short pdf: How Bochs Works Under the Hood 2nd edition. They save the result so they can derive ZF, SF, and PF, and the carry-out from the high 2 bits for CF and OF, and from bit 3 for AF. With this, they never need to replay an instruction to compute its flag results.

There are extra complications from some instructions not writing all the flags (i.e. partial-flag updates), and presumably from instructions like BSF that set ZF based on the input not the output.


Further reading:

This paper on emulators.com gives a lot of details on how to efficiently save enough state to reconstruct flags. It has a "2.1 Lazy Arithmetic Flags for CPU Emulation".

One of the authors is Darek Mihocka (long time emulator writer, now working at Intel apparently). He has written much interesting stuff about making non-JIT emulators run fast, and CPU performance stuff in general, much of it posted on his site, http://www.emulators.com/. E.g. this article about avoiding branch-misprediction in an emulator's interpreter loop that dispatches to functions that implement each opcode is quite interesting. Darek is also the co-author of that article about BOCHS internals I linked earlier.

A google hit for lazy flag eval may also be relevant: https://silviocesare.wordpress.com/2009/03/08/lazy-eflags-evaluation-and-other-emulator-optimisations/

Last time emulation of x86-like flags came up, the discussion in comments on my lazy-flags answer had some interesting stuff: e.g. @Raymond Chen suggested that link to the Mihocka & Troeger paper, and @amdn pointed out that JIT dynamic translation can produce faster emulation than interpretation.

Upvotes: 6

Alexis Wilke
Alexis Wilke

Reputation: 20818

If you want your emulator to simulate the processor as it is, then yes, you need to emulate the flags exactly.

This means clearing bits that need to be cleared (with AND), setting bits that need to be set (with OR), and copying / calculating bits when required (i.e. the Z flag requires testing whether the result is zero, the carry requires to know whether you have an overflow, etc.)

There is no way around it.

This is just like decoding the R/M mod byte. You have no way around having to load that byte, check the mode to determine whether this is a register or a memory access, and apply those accordingly...

And in effect, that means your emulator will be "much slower" (unless you are emulating an old 10Mhz processor with a 3Ghz modern processor, when you anyway have time to execute 300 cycles of instructions... so you should be just fine.)

If you're interested, I wrote a 6502 emulator and got it tested with an Apple 2 ROM. I had to add sleeps to not have it run at 100Mhz or more... (that processor was originally running 1Mhz...)

Upvotes: 6

Related Questions