SolaGratia
SolaGratia

Reputation: 319

Which is the more efficient dispatch method for a virtual machine?

Which would be a more efficient dispatch method for making my fetch-decode-execute times that little bit quicker?

For simplicity, I've kept this to a minimum, like operations operate on 1 byte operands and there are only two, for example.

The method I'm using at the moment (simplified) is:

typedef unsigned char byte;

vector<byte> _program = { INST::PUSH, 32, INST::POP};

enum INST {
    PUSH =0, /*index=0*/
    POP =1, /*index=1*/
}

//DISPATCHING METHOD #1
switch (curr_instruction) {
    case INST::PUSH: {
        /*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
    }
    case INST::POP: {
        /*declared inline*/ _pop_stack();
    }
}

OR using a function pointer table to execute each instruction in the 'program' (vector of bytes/ vector _program), like so:

typedef void (*voidptr)();

void hndl_push(){
    /*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
}
void hndl_push(){
    /*declared inline*/ _pop_stack();
}

funcptr handlers[2] = {&hndl_push /*index=0*/, & hdnl_pop /*index=1*/}'
vector<byte> _program = { INST::PUSH, 32, INST::POP};

size_t instr_ptr=0;

//DISPATCHING METHOD #2
while (instr_ptr != _program.size()){
    instr_ptr++;
    _handlers[instr_ptr]();
}

I am using the VC++ (Visual Studio) compiler, the 2015 version.

Which of these is converted into more efficient assembler with the least overhead, or are they the same?

Thank you in advance!

Upvotes: 2

Views: 1016

Answers (3)

Peter Cordes
Peter Cordes

Reputation: 366094

Indirect branch prediction is hard, but at least there's only the one unconditional branch. The branch predictor needs to correctly predict the branch target address to be able to keep the pipeline fed.

However, an unpredictable conditional branch is bad, too. With enough cases, a single indirect branch will do better than multiple conditional branches, so that's what compilers do under the hood. With just two cases, you will almost certainly get better results from letting the compiler choose how to implement the switch.

Conditional branch predictors in some CPUs might be better at recognizing simple (but non-trivial) patterns than indirect branch predictors, but Intel SnB-family CPUs at least can recognize some target-address patterns for indirect branches. (Agner Fog's microarch pdf has a bit of info on branch predictors.)


See Lookup table vs switch in C embedded software, including my answer on that question where I posted some x86 compiler output.

Note the fact that clang will use a jump table to branch to a call instruction, rather than putting the function pointers themselves into an array. So when a jump table is the right choice, implementing it yourself with an array of function pointers will make better code than current clang.

Your code is exactly the same case: dispatching to a lot of functions that all have the same prototype and (lack of) args, so they can go into an array of function pointers.

Upvotes: 0

Adrian McCarthy
Adrian McCarthy

Reputation: 48038

The only way to know which would be faster is to measure.

The optimizer may be able to do quite a bit with either technique. Dense switch statements are often reduced to a jump table, and the function calls may be inlined, so that could be the fastest approach.

But if, for whatever reason, the optimizer cannot inline or if the switch statement becomes a cascaded of if-else statements, then the function pointer calls may be faster.

Wikipedia has a decent article on threaded code, which describes various techniques to dispatch opcodes in a virtual machine or interpreter.

Upvotes: 5

Gerry Candy
Gerry Candy

Reputation: 36

How could the second solution possibly be quicker than the first, but best the compiler could convert the second into the first anyway.

As a side note you need to change the program pointer dependant on the opcode.

Upvotes: 0

Related Questions