C++ For loop optimizations for a virtual machine

Question

Context

My question is twofold (really two questions) but quite basic*. But first, I will show some relevant code for some context. For the TL;DR 'meat and potatoes', skip to the bottom for the actual questions.

*(I'm assuming answerers are aware of what is happening/how a virtual machine operates fundamentally before attempting to answer).

As mentioned, I am writing a (toy) VM, which executes a custom byte code instruction set.

(ellipses here only represent omission of some cases)

Here is a snippet of my code:

for (ip = 0; (ip < _PROGRAM_SIZE || !cstackempty); ip++) {
        if (breakPending) { break; }

            switch (_instr) {

               case INST::PUSH: {
                   AssertAbort(wontoverflow(1), "Stack overflow (1 byte)");
                   cmd_ "PUSH";
                   push(_incbyte);
                   printStack();
                   break;
               }

        ...
               case INST::ADD: {
                   AssertAbort(stackhas(2), "Can't pop stack to add 2 bytes. Stack does not contain 2 bytes");
                   cmd_ "ADD";
                   byte popped8_a = pop();
                   byte popped8_b = pop();
                   byte result = popped8_a + popped8_b;
                   push(result);
                   cmd_ " "; cmd_(byte)result;
                   printStack();
                   break;
               }

               case INST::ADD16: {
                   AssertAbort(stackhas(4), "Can't pop stack to add 4 bytes. Stack does not contain 4 bytes");
                   cmd_ "ADD16";
                   u16 popped16_a = pop16();
                   u16 popped16_b = pop16();
                   u16 result = popped16_a + popped16_b;
                   push16(result);
                   cmd << " "; cmd << (u16)result;
                   printStack();
                   break;
               }
        ...
            }
}

Only because it's relevant, I will mention that _cstack is the call stack, hence the !cstackempty macro, which checks in case the call is empty before calling quits (exiting the for loop) just because it's the last instruction being executed (because that last instruction could we be part of a function, or even a return). Also, ip (instruction pointer) is simply an unsigned long long (u64), as is _PROGRAM_SIZE (size of program in bytes). instr is a byte and is a reference to the current instruction (1 byte).

Meat and potatoes

Question 1: Since I'm initialising two new integers of variable size per block/case (segmented into blocks to avoid redeclaration errors and such), would declaring them above the for loop be helpful in terms of speed, assignment latency, program size etc?

Question 2: Would continue be faster than break in this case, and is there any faster way to execute such a conditional loop, such as some kind of goto-pointer-to-label like in this post, that is implementation agnostic, or somehow avoid the cost of either continue or break?

To summarize, my priorities are speed, then memory costs (speed, efficiency), then file size (of the VM).

BeeOnRope · Accepted Answer

Before answering the specific questions, a note: There isn't any CPU that executes C++ directly. So any question of this type of micro-optimization at the language level depends heavily on the compiler, software runtime environment and target hardware. It is entirely possible that one technique works better on the compiler you are using today, but worse on the one you use tomorrow. Similarly for hardware choices such as CPU architecture.

The only way to get a definitive answer of which is better is to benchmark it in a realistic situation, and often the only way to understand the benchmark results is to dive into the generated assembly. If this kind of optimization is important to you, consider learning a bit about the assembly language for your development architecture.

Given that, I'll pick a specific compiler (gcc) and a common architecture (x86) and answer in that context. The details will differ slightly for other choices, but I expect the broad strokes to be similar for any decent compiler and hardware combination.

Question 1

The place of declaration doesn't matter. The declaration itself, doesn't even really turn into code - it's only the definition and use that generate code.

For example, consider the two variants of a simple loop below (the external sink() method is just there to avoid optimizing way the assignment to a):

Declaration Inside Loop

int func(int* num) {
  for (unsigned int i=0; i<100; i++) {
    int a = *num + *num;
    sink(a);
    sink(a);
  }
}

Declaration Outside Loop

int func(int* num) {
  int a;
  for (unsigned int i=0; i<100; i++) {
    a = *num + *num;
    sink(a);
    sink(a);
  }
}

We can use the godbolt compiler explorer to easily check the assembly generated for the first and second variants. They are identical - here's the loop:

.L2:
        mov     ebp, DWORD PTR [r12]
        add     ebx, 1
        add     ebp, ebp
        mov     edi, ebp
        call    sink(int)
        mov     edi, ebp
        call    sink(int)
        cmp     ebx, 100
        jne     .L2

Basically the declaration doesn't produce any code - only the assignment does.

Question 2

Here it is key to note that at the hardware level, there aren't instructions like "break" or "continue". You really only have jumps, either conditional or not, which are basically gotos. Both break and continue will be translated to jumps. In your case, a break inside a switch, where the break is the last statement in the loop, and a continue inside the switch have exactly the same effect, so I expect them to be compiled identically, but let's check.

Let's use this test case:

int func(unsigned int num, int iters) {
  for (; iters > 0; iters--) {
    switch (num) {
      case 0:
        sinka();
        break;
      case 1:
        sinkb();
        break;
      case 2:
        sinkc();
        break;
      case 3:
        sinkd();
        break;
      case 4:
        sinkd();
        break;
    }
  }
}

It uses the break to exist the case. Here's the godbolt output on gcc 4.4.7 for x86, ignoring the function prologue:

.L13:
        cmp     ebp, 4
        ja      .L3
        jmp     [QWORD PTR [r13+r12*8]] # indirect jump
.L9:
        .quad   .L4
        .quad   .L5
        .quad   .L6
        .quad   .L7
        .quad   .L8
.L4:
        call    sinka()
        jmp     .L3
.L5:
        call    sinkb()
        jmp     .L3
.L6
        call    sinkc()
        jmp     .L3
.L7
        call    sinkd()
        jmp     .L3
.L8
        call    sinkd()
.L3:
        sub     ebx, 1
        test    ebx, ebx
        jg      .L13

Here, the compile has chosen a jump table approach. The value of num is used to look up a jump address (the table is the series of .quad directives), and then an indirect jump is made to one of the label L4 through L8. The breaks change into jmp .L3, which executes the loop logic.

Note that a jump table isn't the only way to compile a switch - if I used 4 or less case statements, the compile instead chose a series of branches.

Let's try the same example, but with each break replaced with a continue:

int func(unsigned int num, int iters) {
  for (; iters > 0; iters--) {
    switch (num) {
      case 0:
        sinka();
        continue;
... [16 lines omitted] ...
    }
  }
}

As you might have guessed by now, the results are identical - at for this particular compiler and target. The continue statements and break statements imply the exact same control flow, so I'd expect this to be true for most decent compilers with optimization turned on.

C++ For loop optimizations for a virtual machine

Answers (2)

Question 1

Question 2

Related Questions