Lawrence Kesteloot
Lawrence Kesteloot

Reputation: 4358

Why is Node 10x slower than Chrome?

I'm running my Z80 emulator both in Chrome and in Node. I get about 10x the performance in Chrome that I do in Node. (100k Z80 instructions take 6 ms in Chrome and 60 ms in Node.) I've run the profiler:

% node --prof index.js
% node --prof-process isolate-0x108000000-25550-v8.log

and it says that 95% of the time is spent in C++:

[Summary]:
  ticks  total  nonlib   name
   103    3.8%    3.8%  JavaScript
  2604   95.2%   95.8%  C++
     6    0.2%    0.2%  GC
    17    0.6%          Shared libraries
    12    0.4%          Unaccounted

The C++ breakdown is:

[C++ entry points]:
  ticks    cpp   total   name
  2127   98.3%   77.7%  T __ZN2v88internal40Builtin_CallSitePrototypeGetPromiseIndexEiPmPNS0_7IsolateE
    32    1.5%    1.2%  T __ZN2v88internal21Builtin_HandleApiCallEiPmPNS0_7IsolateE

I've tracked down CallSitePrototypeGetPromiseIndex to this source file. I'm not using promises, async, or await in my code. My test is just a tight loop of 100k emulated Z80 instructions, no I/O or anything.

I've found others online using the --prof flag and none are finding this in their results. Is it a side-effect of profiling? Am I triggering promises somehow inside the loop? Any reason Node should be this much slower than Chrome?

Details: Node v12.13.1, Chrome 79.0.3945.88.

Upvotes: 2

Views: 542

Answers (1)

Lawrence Kesteloot
Lawrence Kesteloot

Reputation: 4358

Okay this surprisingly similar question had a great answer by Esailija pointing me to this line in the V8 source code. It limits optimizations of switch statements to those under a certain size. The first thing my emulator does is have a 256-entry switch dispatch for the opcode. In my test I'm only passing it 0 (NOP), so it was safe to comment out huge chunks of the cases. Turns out if I comment out 13 of the cases, the performance jumps up by a factor of 25! If I only comment out 12 of the cases, then I get the slow performance.

The link into the V8 source code above is pretty old (2013), so I tried to find the modern equivalent. I didn't find a hard limit, but found several heuristics that decide between table lookups and tree (binary search) lookups (ia32, x86). When I plug in my numbers I don't quite get a borderline case where I found it, so I'm not sure this is the actual cause or whether there's another optimization not being triggered elsewhere.

As to the difference with Chrome, there's probably some subtle difference in when and how they decide to optimize their switches.

I'm not sure what the best solution is here, but clearly I need to avoid large switch statements. I'll either have a sequence of smaller switch statements, or replace the whole thing with an array of functions.

Update: I used an array of functions and my entire program sped up by a factor of 25.

Upvotes: 4

Related Questions