Reputation: 769
Logically thinking, writes to any CPU stored variable should be faster than the respective memory operation, because there are no chances of a cache miss. These states are cached in the CPU, and they do not change any CPU state before the next VMLAUNCH/VMRESUME operation. Therefore, they should be faster than an equivalent operation on a memory address
This question arises when looking at different virtualization solutions provided by AMD and Intel. Intel has mandated that all changes to the VMCS data structure should always go through a VMREAD/VMWRITE interface and not through regular memory R/W operations. However, AMD does not pose any such restriction, and its VMCB region is modified by regular memory operations.
The gains in Intel approach should be faster VMExit/VMResume times vs AMD. However, Intel would loose out on the flexibility front with adding new instructions.
However, in reality, VMREAD/VMWRITE operations are slower than regular memory operations. This does not make any sense to me.
Upvotes: 1
Views: 537
Reputation: 365537
Regular memory reads/writes are handled with dedicated hardware to optimize the hell out of them, because real programs are full of them.
Most workloads don't spend very much time on modifying special CPU control registers, so the internal handling of these instructions is often not heavily optimized. Internally, it may be microcoded (i.e. decodes to many uops from the microcode ROM).
Segment registers might not be a great analogy, because writing one triggers the CPU to load a descriptor from the GDT / LDT. But according to Agner Fog's testing for Nehalem, mov sr, r
has one per 13 cycle throughput, and decodes to 6 uops (from microcode). (He stopped testing segment register stuff for later CPUs.) Actually, I'm not sure if that's in 16-bit or 32-bit mode. If it's 16-bit real mode, then writing a segment register doesn't read a descriptor; it just updates the base and limit.
Reading a segment register is faster: one per clock. But that's still slower than reading a normal register (regular mov
instructions have 0.33c throughput on Nehalem).
Nehalem could only load and/or store once per clock, unlike Sandybridge-family which can do 2 loads per clock. But segment-register reads probably aren't faster.
Move to/from control registers might be even slower, because it's rarer than segment registers.
Upvotes: 2