Reputation: 54026
First of I have read the answer to Why is my WebAssembly function slower than the JavaScript equivalent?
But it has shed little light on the problem, and I have invested a lot of time that may well be that yellow stuff against the wall.
I do not use globals, I do not use any memory. I have two simple functions that find the length of a line segment and compare them to the same thing in plain old Javascript. I have 4 params 3 more locals and returns a float or double.
On Chrome the Javascript is 40 times faster than the webAssembly and on firefox the wasm is almost 300 times slower than the Javascript.
I have added a test case to jsPref WebAssembly V Javascript math
Either
Please please be option 1.
I have read the webAssembly use case
Re-use existing code by targeting WebAssembly, embedded in a larger JavaScript / HTML application. This could be anything from simple helper libraries, to compute-oriented task offload.
I was hoping I could replace some geometry libs with webAssembly to get some extra performance. I was hoping that it would be awesome, like 10 or more times faster. BUT 300 times slower WTF.
This is not a JS optimisation issues.
To ensure that optimisation has as little as possible effect I have tested using the following methods to reduce or eliminate any optimisation bias..
c += length(...
to ensure all code is executed.bigCount += c
to ensure whole function is executed. Not neededMath.hypot
to prove code is being run.// setup and associated functions
const setOf = (count, callback) => {var a = [],i = 0; while (i < count) { a.push(callback(i ++)) } return a };
const rand = (min = 1, max = min + (min = 0)) => Math.random() * (max - min) + min;
const a = setOf(100009,i=>rand(-100000,100000));
var bigCount = 0;
function len(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
function lenSlow(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.hypot(nx,ny);
}
function lenEmpty(x,y,x1,y1){
return x;
}
// Test functions in same scope as above. None is in global scope
// Each function is copied 4 time and tests are performed randomly.
// c += length(... to ensure all code is executed.
// bigCount += c to ensure whole function is executed.
// 4 lines for each function to reduce a inlining skew
// all values are randomly generated doubles
// each function call returns a different result.
tests : [{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += length(a1,a2,a3,a4);
c += length(a2,a3,a4,a1);
c += length(a3,a4,a1,a2);
c += length(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length64",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lengthF(a1,a2,a3,a4);
c += lengthF(a2,a3,a4,a1);
c += lengthF(a3,a4,a1,a2);
c += lengthF(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length32",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += len(a1,a2,a3,a4);
c += len(a2,a3,a4,a1);
c += len(a3,a4,a1,a2);
c += len(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length JS",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lenSlow(a1,a2,a3,a4);
c += lenSlow(a2,a3,a4,a1);
c += lenSlow(a3,a4,a1,a2);
c += lenSlow(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "Length JS Slow",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lenEmpty(a1,a2,a3,a4);
c += lenEmpty(a2,a3,a4,a1);
c += lenEmpty(a3,a4,a1,a2);
c += lenEmpty(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "Empty",
}
],
Because there is a lot more overhead in the test the results are closer but the JS code is still two orders of magnitude faster.
Note how slow the function Math.hypo
t is. If optimisation was in effect that function would be near the faster len
function.
/*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 147
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 12736µs ±69µs (*) 3013 samples
---------------------------------------------
Test : 'length32'
Mean : 13389µs ±94µs (*) 2914 samples
---------------------------------------------
Test : 'length JS'
Mean : 728µs ±6µs (*) 2906 samples
---------------------------------------------
Test : 'Length JS Slow'
Mean : 23374µs ±191µs (*) 2939 samples << This function use Math.hypot
rather than Math.sqrt
---------------------------------------------
Test : 'Empty'
Mean : 79µs ±2µs (*) 2928 samples
-All ----------------------------------------
Mean : 10.097ms Totals time : 148431.200ms 14700 samples
(*) Error rate approximation does not represent the variance.
*/
Whats the point of WebAssambly if it does not optimise
End of update
Find length of a line.
Original source in custom language
// declare func the < indicates export name, the param with types and return type
func <lengthF(float x, float y, float x1, float y1) float {
float nx, ny, dist; // declare locals float is f32
nx = x1 - x;
ny = y1 - y;
dist = sqrt(ny * ny + nx * nx);
return dist;
}
// and as double
func <length(double x, double y, double x1, double y1) double {
double nx, ny, dist;
nx = x1 - x;
ny = y1 - y;
dist = sqrt(ny * ny + nx * nx);
return dist;
}
Code compiles to Wat for proof read
(module
(func
(export "lengthF")
(param f32 f32 f32 f32)
(result f32)
(local f32 f32 f32)
get_local 2
get_local 0
f32.sub
set_local 4
get_local 3
get_local 1
f32.sub
tee_local 5
get_local 5
f32.mul
get_local 4
get_local 4
f32.mul
f32.add
f32.sqrt
)
(func
(export "length")
(param f64 f64 f64 f64)
(result f64)
(local f64 f64 f64)
get_local 2
get_local 0
f64.sub
set_local 4
get_local 3
get_local 1
f64.sub
tee_local 5
get_local 5
f64.mul
get_local 4
get_local 4
f64.mul
f64.add
f64.sqrt
)
)
As compiled wasm in hex string (Note does not include name section) and loaded using WebAssembly.compile. Exported functions then run against Javascript function len (in below snippet)
// hex of above without the name section
const asm = `0061736d0100000001110260047d7d7d7d017d60047c7c7c7c017c0303020001071402076c656e677468460000066c656e67746800010a3b021c01037d2002200093210420032001932205200594200420049492910b1c01037c20022000a1210420032001a122052005a220042004a2a09f0b`
const bin = new Uint8Array(asm.length >> 1);
for(var i = 0; i < asm.length; i+= 2){ bin[i>>1] = parseInt(asm.substr(i,2),16) }
var length,lengthF;
WebAssembly.compile(bin).then(module => {
const wasmInstance = new WebAssembly.Instance(module, {});
lengthF = wasmInstance.exports.lengthF;
length = wasmInstance.exports.length;
});
// test values are const (same result if from array or literals)
const a1 = rand(-100000,100000);
const a2 = rand(-100000,100000);
const a3 = rand(-100000,100000);
const a4 = rand(-100000,100000);
// javascript version of function
function len(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
And the test code is the same for all 3 functions and run in strict mode.
tests : [{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
length(a1,a2,a3,a4);
}
},
name : "length64",
},{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
lengthF(a1,a2,a3,a4);
}
},
name : "length32",
},{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
len(a1,a2,a3,a4);
}
},
name : "lengthNative",
}
]
The test results on FireFox are
/*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 34
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 26359µs ±128µs (*) 1128 samples
---------------------------------------------
Test : 'length32'
Mean : 27456µs ±109µs (*) 1144 samples
---------------------------------------------
Test : 'lengthNative'
Mean : 106µs ±2µs (*) 1128 samples
-All ----------------------------------------
Mean : 18.018ms Totals time : 61262.240ms 3400 samples
(*) Error rate approximation does not represent the variance.
*/
Upvotes: 36
Views: 22153
Reputation: 8484
It seemed like
- WebAssembly is far from a ready technology.
actually did play a role in this, and performance of calling WASM from JS in Firefox was improved in late 2018. Running your benchmarks in a current FF/Chromium yields results like "Calling the WASM implementation from JS is 4-10 times slower than calling the JS implementation from JS". Still, it seems like engines don't inline across WASM/JS borders, and the overhead of having to call vs. not having to call is significant (as the other answers already pointed out).
Your benchmarks are all wrong. It turns out that JS is actually 8-40 times (FF, Chrome) slower than WASM. WTF, JS is soo slooow.
Do I intend to prove that? Of course (not).
First, I re-implement your benchmarking code in C:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static double lengthC(double x, double y, double x1, double y1) {
double nx = x1 - x;
double ny = y1 - y;
return sqrt(nx * nx + ny * ny);
}
double lengthArrayC(double* a, size_t length) {
double c = 0;
for (size_t i = 0; i < length; i++) {
double a1 = a[i + 0];
double a2 = a[i + 1];
double a3 = a[i + 2];
double a4 = a[i + 3];
c += lengthC(a1,a2,a3,a4);
c += lengthC(a2,a3,a4,a1);
c += lengthC(a3,a4,a1,a2);
c += lengthC(a4,a1,a2,a3);
}
return c;
}
#ifdef __wasm__
__attribute__((import_module("js"), import_name("len")))
double lengthJS(double x, double y, double x1, double y1);
double lengthArrayJS(double* a, size_t length) {
double c = 0;
for (size_t i = 0; i < length; i++) {
double a1 = a[i + 0];
double a2 = a[i + 1];
double a3 = a[i + 2];
double a4 = a[i + 3];
c += lengthJS(a1,a2,a3,a4);
c += lengthJS(a2,a3,a4,a1);
c += lengthJS(a3,a4,a1,a2);
c += lengthJS(a4,a1,a2,a3);
}
return c;
}
__attribute__((import_module("bench"), import_name("now")))
double now();
__attribute__((import_module("bench"), import_name("result")))
void printtime(int benchidx, double ns);
#else
void printtime(int benchidx, double ns) {
if (benchidx == 1) {
printf("C: %f ns\n", ns);
} else if (benchidx == 0) {
printf("avoid the optimizer: %f\n", ns);
} else {
fprintf(stderr, "Unknown benchmark: %d", benchidx);
exit(-1);
}
}
double now() {
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
return (double)ts.tv_sec + (double)ts.tv_nsec / 1e9;
} else {
return sqrt(-1);
}
}
#endif
#define iters 1000000
double a[iters+3];
int main() {
int bigCount = 0;
srand(now());
for (size_t i = 0; i < iters + 3; i++)
a[i] = (double)rand()/RAND_MAX*2e5-1e5;
for (int i = 0; i < 10; i++) {
double startTime, endTime;
double c;
startTime = now();
c = lengthArrayC(a, iters);
endTime = now();
bigCount = (bigCount + (int64_t)c) % 1000;
printtime(1, (endTime - startTime) * 1e9 / iters / 4);
#ifdef __wasm__
startTime = now();
c = lengthArrayJS(a, iters);
endTime = now();
bigCount = (bigCount + (int64_t)c) % 1000;
printtime(2, (endTime - startTime) * 1e9 / iters / 4);
#endif
}
printtime(0, bigCount);
return 0;
}
Compile it with clang 12.0.1:
clang -O3 -target wasm32-wasi --sysroot /opt/wasi-sdk/wasi-sysroot/ foo2.c -o foo2.wasm
And provide it with a length function from JS via imports:
"use strict";
(async (wasm) => {
const wasmbytes = new Uint8Array(wasm.length);
for (var i in wasm)
wasmbytes[i] = wasm.charCodeAt(i);
(await WebAssembly.instantiate(wasmbytes, {
js: {
len: function (x,y,x1,y1) {
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
},
bench: {
now: () => window.performance.now() / 1e3,
result: (bench, ns) => {
let name;
if (bench == 1) { name = "C" }
else if (bench == 2) { name = "JS" }
else if (bench == 0) { console.log("Optimizer confuser: " + ns); /*not really necessary*/; return; }
else { throw "unknown bench"; }
console.log(name + ": " + ns + " ns");
},
},
})).instance.exports._start();
})(atob('AGFzbQEAAAABFQRgBHx8fHwBfGAAAXxgAn98AGAAAAIlAwJqcwNsZW4AAAViZW5jaANub3cAAQViZW5jaAZyZXN1bHQAAgMCAQMFAwEAfAcTAgZtZW1vcnkCAAZfc3RhcnQAAwr2BAHzBAMIfAJ/An5BmKzoAwJ/EAEiA0QAAAAAAADwQWMgA0QAAAAAAAAAAGZxBEAgA6sMAQtBAAtBAWutNwMAQejbl3whCANAQZis6ANBmKzoAykDAEKt/tXk1IX9qNgAfkIBfCIKNwMAIAhBmKzoA2ogCkIhiKe3RAAAwP///99Bo0QAAAAAAGoIQaJEAAAAAABq+MCgOQMAIAhBCGoiCA0ACwNAEAEhBkGQCCsDACEBQYgIKwMAIQRBgAgrAwAhAEQAAAAAAAAAACECQRghCANAIAQhAyABIgQgAKEiASABoiIHIAMgCEGACGorAwAiAaEiBSAFoiIFoJ8gACAEoSIAIACiIgAgBaCfIAAgASADoSIAIACiIgCgnyACIAcgAKCfoKCgoCECIAMhACAIQQhqIghBmKToA0cNAAtBARABIAahRAAAAABlzc1BokQAAAAAgIQuQaNEAAAAAAAA0D+iEAICfiACmUQAAAAAAADgQ2MEQCACsAwBC0KAgICAgICAgIB/CyALfEQAAAAAAAAAACECQYDcl3whCBABIQMDQCACIAhBgKzoA2orAwAiBSAIQYis6ANqKwMAIgEgCEGQrOgDaisDACIAIAhBmKzoA2orAwAiBBAAoCABIAAgBCAFEACgIAAgBCAFIAEQAKAgBCAFIAEgABAAoCECIAhBCGoiCA0AC0ECEAEgA6FEAAAAAGXNzUGiRAAAAACAhC5Bo0QAAAAAAADQP6IQAkLoB4EhCgJ+IAKZRAAAAAAAAOBDYwRAIAKwDAELQoCAgICAgICAgH8LIAp8QugHgSELIAlBAWoiCUEKRw0AC0EAIAuntxACCwB2CXByb2R1Y2VycwEMcHJvY2Vzc2VkLWJ5AQVjbGFuZ1YxMS4wLjAgKGh0dHBzOi8vZ2l0aHViLmNvbS9sbHZtL2xsdm0tcHJvamVjdCAxNzYyNDliZDY3MzJhODA0NGQ0NTcwOTJlZDkzMjc2ODcyNGE2ZjA2KQ=='))
Now, calling the JS function from WASM is unsurprisingly a lot slower than calling the WASM function from WASM. (In fact, WASM→WASM it isn't calling. You can see the f64.sqrt
being inlined into _start
.)
(One last interesting datapoint is that WASM→WASM and JS→JS seem to have about the same cost (about 1.5 ns per inlined length(…)
on my E3-1280). Disclaimer: It's entirely possible that my benchmark is even more broken than the original question.)
WASM isn't slow, crossing the border is. For now and the foreseeable future, don't put things into WASM unless they're a significant computational task. (And even then, it depends. Sometimes, JS engines are really smart. Sometimes.)
Upvotes: 1
Reputation: 70122
Andreas describes a number of good reasons why the JavaScript implementation was initially observed to be x300 faster. However, there are a number of other issues with your code.
For a more definitive answer, see the joint paper from the WebAssembly team, which outlines an expected runtime performance gain of around 30%
Finally, to answer your point:
Whats the point of WebAssembly if it does not optimise
I think you have misconceptions around what WebAssembly will do for you. Based on the paper above, the runtime performance optimisations are quite modest. However, there are still a number of performance advantages:
There are also a number of non-performance related advantages too.
For a more realistic performance measurement, take a look at:
Both are practical, production codebases.
Upvotes: 27
Reputation: 36088
The JS engine can apply a lot of dynamic optimisations to this example:
Perform all calculations with integers and only convert to double for the final call to Math.sqrt.
Inline the call to the len
function.
Hoist the computation out of the loop, since it always computes the same thing.
Recognise that the loop is left empty and eliminate it entirely.
Recognise that the result is never returned from the testing function, and hence remove the entire body of the test function.
All but (4) apply even if you add the result of every call. With (5) the end result is an empty function either way.
With Wasm an engine cannot do most of these steps, because it cannot inline across language boundaries (at least no engine does that today, AFAICT). Also, for Wasm it is assumed that the producing (offline) compiler has already performed relevant optimisations, so a Wasm JIT tends to be less aggressive than one for JavaScript, where static optimisation is impossible.
Upvotes: 7