c++performanceinstruction-setvm-implementation

Reputation:

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

I am developing a simple VM and I am in the middle of a crossroad.

My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.

However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.

So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.

Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:

    .globl  __Z20assign_i8u_reg8_imm8v
    .def    __Z20assign_i8u_reg8_imm8v; .scl    2;  .type   32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
    .cfi_startproc
    movl    _ip, %eax
    movb    3(%eax), %cl
    movzbl  2(%eax), %eax
    movl    _sp, %edx
    movb    %cl, (%edx,%eax)
    addl    $4, _ip
    ret
    .cfi_endproc
LFE13:
    .p2align 2,,3
    .globl  __Z18assign_i8u_reg_regv
    .def    __Z18assign_i8u_reg_regv;   .scl    2;  .type   32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
    .cfi_startproc
    movl    _ip, %edx
    movl    _sp, %eax
    movzbl  3(%edx), %ecx
    movb    (%ecx,%eax), %cl
    movzbl  2(%edx), %edx
    movb    %cl, (%eax,%edx)
    addl    $4, _ip
    ret
    .cfi_endproc
LFE14:
    .p2align 2,,3
    .globl  __Z24assign_i8u_reg_globCachev
    .def    __Z24assign_i8u_reg_globCachev; .scl    2;  .type   32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
    .cfi_startproc
    movl    _ip, %eax
    movl    _sp, %edx
    movl    4(%eax), %ecx
    addl    %edx, %ecx
    movl    %ecx, 4(%eax)
    movb    (%ecx), %cl
    movzwl  2(%eax), %eax
    movb    %cl, (%eax,%edx)
    addl    $8, _ip
    ret
    .cfi_endproc
LFE15:
    .p2align 2,,3
    .globl  __Z19assign_i8u_reg_globv
    .def    __Z19assign_i8u_reg_globv;  .scl    2;  .type   32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
    .cfi_startproc
    movl    _ip, %eax
    movl    4(%eax), %edx
    movb    (%edx), %cl
    movzwl  2(%eax), %eax
    movl    _sp, %edx
    movb    %cl, (%edx,%eax)
    addl    $8, _ip
    ret
    .cfi_endproc

This example contains the instructions to:

assign unsigned byte from immediate value to register
assign unsigned byte from register to register
assign unsigned byte from global offset to register and, cache and change to direct instruction
assign unsigned byte from global offset to register (the now cached previous version)
... and so on...

Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.

I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?

I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp pointer, which points to the stack that implements the registers (there is no other stack), ip is logically the current instruction pointer, and gp is the global pointer (not referenced, accessed as an offset).

EDIT: Also, this is the basic format I am implementing the instructions in:

INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
    FETCH(globallAddressCache);
    REG(quint8, i.d16_1) = GLOB(quint8);
    INC(globallAddressCache);
}

FETCH returns a reference to the struct, which the instruction is using based on the opcode

REG returns a reference to register value T from offset

GLOB retursn a reference to global value from a cached global offset (effectively absolute address)

INC just increments the instruction pointer by the size of the instruction.

Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.

EDIT: I would like to add a few points to the question:

I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.

To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.

My instruciton size data is based on those stats.

Upvotes: 9

Answers (5)

supercat

Reputation: 81347

One thing you should decide is what balance you wish to strike between code-file size efficiency, cache efficiency, and raw-execution-speed efficiency. Depending upon the coding patterns for the code you're interpreting, it may be helpful to have each instruction, regardless of its length in the code file, get translated into a structure containing a pointer and an integer. The first pointer would point to a function that takes a pointer to the instruction-info structure as well as to the execution context. The main execution loop would thus be something like:

do
{
  pc = pc->func(pc, &context);
} while(pc);

the function associated with an "add short immediate instruction" would be something like:

INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
  context->op_stack[0] += pc->operand;
  return pc+1;
}

while "add long immediate" would be: INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context) { context->op_stack[0] += (uint32_t)pc->operand + ((int64_t)(pc[1].operand) << 32); return pc+2; }

and the function associated with an "add local" instruction would be:

INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
  CONTEXT_ITEM *op_stack = context->op_stack;
  op_stack[0].asInt64 += op_stack[pc->operand].asInt64;
  return pc+1;
}

Your "executables" would consist of compressed bytecode format, but they would then get translated into a table of instructions, eliminating a level of indirection when decoding the instructions at run-time.

Upvotes: 0

Radnyx

Reputation: 324

The instruction length in bytes has been handled the same way for quite a while. Obviously being limited to 256 instructions isn't a good thing when there's so many types of operations you wish to perform.

This is why there's an prefix value. Back in the gameboy architecture, there wasn't enough room to include the needed 256 bit-control instructions, that's why one opcode was used as a prefix instruction. This kept the original 256 opcodes as well as 256 more if starting with that prefix byte.

For example: One operation might look like this: D6 FF = SUB A, 0xFF

But a prefixed instruction would be presented as: CB D6 FF = SET 2, (HL)

If the processor read CB it'd immediately start looking in another instruction set of 256 opcodes.

The same goes for x86 architecture today. Where any instructions prefixed with 0F would be a part of another instruction set, essentially.

With the sort of execution you're using for your emulator, this is the best way of extending your instruction set. 16-bit opcodes would take up way more space than necessary, and the prefix doesn't provide such a long search.

Upvotes: 0

dtech

Reputation: 49329

I think you are asking the wrong question, and not because it is a bad question, on the contrary, it is an interesting subject and I suspect many people are interested in the results just as I am.

However, so far no one is sharing similar experience, so I guess you may have to do some pioneering. Instead of wondering which approach to use and waste time on the implementation of boilerplate code, focus on creating a “reflection” component that describes the structure and properties of the language, create a nice polymorphic structure with virtual methods, without worrying about performance, create modular components you can assemble during runtime, there is even the option to use a declarative language once you have established the object hierarchy. Since you appear to use Qt, you have half the work cut out for you. Then you can use the tree structure to analyze and generate a variety of different code – C code to compile or bytecode for a specific VM implementation, of which you can create multiple, you can even use that to programmatically generate the C code for your VM instead of typing it all by hand.

I think this set of advices will be more beneficial in case you resort to pioneering on the subject without a concrete answer in advance, it will allow you to easily test out all the scenarios and make your mind based on actual performance rather than personal assumptions and those of others. Then maybe you can share the results and answer your question with performance data.

Upvotes: 1

MSalters

Reputation: 180303

You might want to consider separating the VM ISA and its implementation.

For instance, in a VM I wrote I had a "load value direct" instruction. The next value in the instruction stream wasn't decoded as an instruction, but loaded as a value into a register. You can consider this one macro instruction or two separate values.

Another instruction I implemented was a "load constant value", which took loaded a constant from memory (using a base address for the table of constants and an offset). A common pattern in the instruction stream was therefore load value direct (index); load constant value. Your VM implementation may recognize this pattern and handle the pair with a single optimized implementation.

Obviously, if you have enough bits, you can use some of them to identify a register. With 8 bits it may be necessary to have a single register for all operations. But again, you could add another instruction with register X which modifies the next operation. In your C++ code, that instruction would merely set the currentRegister pointer which the other instructions use.

Upvotes: 5

Mats Petersson

Reputation: 129524

Will more compiled routines make up for a larger dispatch loop?

I take it you didn't fancy having single byte instructions with a second byte of extra opcode for certain instructions? I think a decode for 16-bit opcodes may be less efficient than 8-bit + extra byte(s), assuming the extra byte(s) aren't too common or too difficult to decode in themselves.

If it was me, I'd work on getting the compiler (not necessarily a full-fledged compiler with "everything", but a basic model) going with a fairly limited set of "instructions". Keep the code generation part fairly flexible so that it'll be easy to alter the actual encoding later. Once you have that working, you can experiment with various encodings and see what the result is in performance, and other aspects.

A lot of your minor question points are very hard to answer for anyone that hasn't done both of the choices. I have never written a VM in this sense, but I have worked on several disassemblers, instruction set simulators and such things. I have also implemented a couple of languages of different kinds, in terms of interpreted languages.

You probably also want to consider a JIT approach, where instead of loading bytecode, you interpret the bytecode and produce direct machine code for the architecture in question.

The GCC code doesn't look terrible, but there are several places where code depends on the value of the immediately preceding instruction - which is not great in modern processors. Unfortunately, I don't see any solution to that - it's a "too short code to shuffle things around" problem - adding more instructions obviously won't work.

I do see one little problem: Loading a 32-bit constant will require that it's 32-bit aligned for best performance. I have no idea how (or if) Java VM's deal with that.

Upvotes: 3

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

Answers (5)

Related Questions