Blindman67
Blindman67

Reputation: 54026

Why is webAssembly function almost 300 time slower than same JS function

Find length of line 300* slower

First of I have read the answer to Why is my WebAssembly function slower than the JavaScript equivalent?

But it has shed little light on the problem, and I have invested a lot of time that may well be that yellow stuff against the wall.

I do not use globals, I do not use any memory. I have two simple functions that find the length of a line segment and compare them to the same thing in plain old Javascript. I have 4 params 3 more locals and returns a float or double.

On Chrome the Javascript is 40 times faster than the webAssembly and on firefox the wasm is almost 300 times slower than the Javascript.

jsPref test case.

I have added a test case to jsPref WebAssembly V Javascript math

What am I doing wrong?

Either

  1. I have missed an obvious bug, bad practice, or I am suffering coder stupidity.
  2. WebAssembly is not for 32bit OS (win 10 laptop i7CPU)
  3. WebAssembly is far from a ready technology.

Please please be option 1.

I have read the webAssembly use case

Re-use existing code by targeting WebAssembly, embedded in a larger JavaScript / HTML application. This could be anything from simple helper libraries, to compute-oriented task offload.

I was hoping I could replace some geometry libs with webAssembly to get some extra performance. I was hoping that it would be awesome, like 10 or more times faster. BUT 300 times slower WTF.


UPDATE

This is not a JS optimisation issues.

To ensure that optimisation has as little as possible effect I have tested using the following methods to reduce or eliminate any optimisation bias..

// setup and associated functions
    const setOf = (count, callback) => {var a = [],i = 0; while (i < count) { a.push(callback(i ++)) } return a };
    const rand  = (min = 1, max = min + (min = 0)) => Math.random() * (max - min) + min;
    const a = setOf(100009,i=>rand(-100000,100000));
    var bigCount = 0;




    function len(x,y,x1,y1){
        var nx = x1 - x;
        var ny = y1 - y;
        return Math.sqrt(nx * nx + ny * ny);
    }
    function lenSlow(x,y,x1,y1){
        var nx = x1 - x;
        var ny = y1 - y;
        return Math.hypot(nx,ny);
    }
    function lenEmpty(x,y,x1,y1){
        return x;
    }


// Test functions in same scope as above. None is in global scope
// Each function is copied 4 time and tests are performed randomly.
// c += length(...  to ensure all code is executed. 
// bigCount += c to ensure whole function is executed.
// 4 lines for each function to reduce a inlining skew
// all values are randomly generated doubles 
// each function call returns a different result.

tests : [{
        func : function (){
            var i,c=0,a1,a2,a3,a4;
            for (i = 0; i < 10000; i += 1) {
                a1 = a[i];
                a2 = a[i+1];
                a3 = a[i+2];
                a4 = a[i+3];
                c += length(a1,a2,a3,a4);
                c += length(a2,a3,a4,a1);
                c += length(a3,a4,a1,a2);
                c += length(a4,a1,a2,a3);
            }
            bigCount = (bigCount + c) % 1000;
        },
        name : "length64",
    },{
        func : function (){
            var i,c=0,a1,a2,a3,a4;
            for (i = 0; i < 10000; i += 1) {
                a1 = a[i];
                a2 = a[i+1];
                a3 = a[i+2];
                a4 = a[i+3];
                c += lengthF(a1,a2,a3,a4);
                c += lengthF(a2,a3,a4,a1);
                c += lengthF(a3,a4,a1,a2);
                c += lengthF(a4,a1,a2,a3);
            }
            bigCount = (bigCount + c) % 1000;
        },
        name : "length32",
    },{
        func : function (){
            var i,c=0,a1,a2,a3,a4;
            for (i = 0; i < 10000; i += 1) {
                a1 = a[i];
                a2 = a[i+1];
                a3 = a[i+2];
                a4 = a[i+3];                    
                c += len(a1,a2,a3,a4);
                c += len(a2,a3,a4,a1);
                c += len(a3,a4,a1,a2);
                c += len(a4,a1,a2,a3);
            }
            bigCount = (bigCount + c) % 1000;
        },
        name : "length JS",
    },{
        func : function (){
            var i,c=0,a1,a2,a3,a4;
            for (i = 0; i < 10000; i += 1) {
                a1 = a[i];
                a2 = a[i+1];
                a3 = a[i+2];
                a4 = a[i+3];                    
                c += lenSlow(a1,a2,a3,a4);
                c += lenSlow(a2,a3,a4,a1);
                c += lenSlow(a3,a4,a1,a2);
                c += lenSlow(a4,a1,a2,a3);
            }
            bigCount = (bigCount + c) % 1000;
        },
        name : "Length JS Slow",
    },{
        func : function (){
            var i,c=0,a1,a2,a3,a4;
            for (i = 0; i < 10000; i += 1) {
                a1 = a[i];
                a2 = a[i+1];
                a3 = a[i+2];
                a4 = a[i+3];                    
                c += lenEmpty(a1,a2,a3,a4);
                c += lenEmpty(a2,a3,a4,a1);
                c += lenEmpty(a3,a4,a1,a2);
                c += lenEmpty(a4,a1,a2,a3);
            }
            bigCount = (bigCount + c) % 1000;
        },
        name : "Empty",
    }
],

Results from update.

Because there is a lot more overhead in the test the results are closer but the JS code is still two orders of magnitude faster.

Note how slow the function Math.hypot is. If optimisation was in effect that function would be near the faster len function.

/*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 147
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 12736µs ±69µs (*) 3013 samples
---------------------------------------------
Test : 'length32'
Mean : 13389µs ±94µs (*) 2914 samples
---------------------------------------------
Test : 'length JS'
Mean : 728µs ±6µs (*) 2906 samples
---------------------------------------------
Test : 'Length JS Slow'
Mean : 23374µs ±191µs (*) 2939 samples   << This function use Math.hypot 
                                            rather than Math.sqrt
---------------------------------------------
Test : 'Empty'
Mean : 79µs ±2µs (*) 2928 samples
-All ----------------------------------------
Mean : 10.097ms Totals time : 148431.200ms 14700 samples
(*) Error rate approximation does not represent the variance.

*/

Whats the point of WebAssambly if it does not optimise

End of update


All the stuff related to the problem.

Find length of a line.

Original source in custom language

   
// declare func the < indicates export name, the param with types and return type
func <lengthF(float x, float y, float x1, float y1) float {
    float nx, ny, dist;  // declare locals float is f32
    nx = x1 - x;
    ny = y1 - y;
    dist = sqrt(ny * ny + nx * nx);
    return dist;
}
// and as double
func <length(double x, double y, double x1, double y1) double {
    double nx, ny, dist;
    nx = x1 - x;
    ny = y1 - y;
    dist = sqrt(ny * ny + nx * nx);
    return dist;
}

Code compiles to Wat for proof read

(module
(func 
    (export "lengthF")
    (param f32 f32 f32 f32)
    (result f32)
    (local f32 f32 f32)
    get_local 2
    get_local 0
    f32.sub
    set_local 4
    get_local 3
    get_local 1
    f32.sub
    tee_local 5
    get_local 5
    f32.mul
    get_local 4
    get_local 4
    f32.mul
    f32.add
    f32.sqrt
)
(func 
    (export "length")
    (param f64 f64 f64 f64)
    (result f64)
    (local f64 f64 f64)
    get_local 2
    get_local 0
    f64.sub
    set_local 4
    get_local 3
    get_local 1
    f64.sub
    tee_local 5
    get_local 5
    f64.mul
    get_local 4
    get_local 4
    f64.mul
    f64.add
    f64.sqrt
)
)

As compiled wasm in hex string (Note does not include name section) and loaded using WebAssembly.compile. Exported functions then run against Javascript function len (in below snippet)

    // hex of above without the name section
    const asm = `0061736d0100000001110260047d7d7d7d017d60047c7c7c7c017c0303020001071402076c656e677468460000066c656e67746800010a3b021c01037d2002200093210420032001932205200594200420049492910b1c01037c20022000a1210420032001a122052005a220042004a2a09f0b`
    const bin = new Uint8Array(asm.length >> 1);
    for(var i = 0; i < asm.length; i+= 2){ bin[i>>1] = parseInt(asm.substr(i,2),16) }
    var length,lengthF;

    WebAssembly.compile(bin).then(module => {
        const wasmInstance = new WebAssembly.Instance(module, {});
        lengthF = wasmInstance.exports.lengthF;
        length = wasmInstance.exports.length;
    });
    // test values are const (same result if from array or literals)
    const a1 = rand(-100000,100000);
    const a2 = rand(-100000,100000);
    const a3 = rand(-100000,100000);
    const a4 = rand(-100000,100000);

    // javascript version of function
    function len(x,y,x1,y1){
        var nx = x1 - x;
        var ny = y1 - y;
        return Math.sqrt(nx * nx + ny * ny);
    }

And the test code is the same for all 3 functions and run in strict mode.

 tests : [{
        func : function (){
            var i;
            for (i = 0; i < 100000; i += 1) {
               length(a1,a2,a3,a4);

            }
        },
        name : "length64",
    },{
        func : function (){
            var i;
            for (i = 0; i < 100000; i += 1) {
                lengthF(a1,a2,a3,a4);
             
            }
        },
        name : "length32",
    },{
        func : function (){
            var i;
            for (i = 0; i < 100000; i += 1) {
                len(a1,a2,a3,a4);
             
            }
        },
        name : "lengthNative",
    }
]

The test results on FireFox are

 /*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 34
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 26359µs ±128µs (*) 1128 samples
---------------------------------------------
Test : 'length32'
Mean : 27456µs ±109µs (*) 1144 samples
---------------------------------------------
Test : 'lengthNative'
Mean : 106µs ±2µs (*) 1128 samples
-All ----------------------------------------
Mean : 18.018ms Totals time : 61262.240ms 3400 samples
(*) Error rate approximation does not represent the variance.
*/

Upvotes: 36

Views: 22153

Answers (3)

Caesar
Caesar

Reputation: 8484

Serious answer

It seemed like

  1. WebAssembly is far from a ready technology.

actually did play a role in this, and performance of calling WASM from JS in Firefox was improved in late 2018. Running your benchmarks in a current FF/Chromium yields results like "Calling the WASM implementation from JS is 4-10 times slower than calling the JS implementation from JS". Still, it seems like engines don't inline across WASM/JS borders, and the overhead of having to call vs. not having to call is significant (as the other answers already pointed out).

Mocking answer

Your benchmarks are all wrong. It turns out that JS is actually 8-40 times (FF, Chrome) slower than WASM. WTF, JS is soo slooow.

Do I intend to prove that? Of course (not).

First, I re-implement your benchmarking code in C:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static double lengthC(double x, double y, double x1, double y1) {
    double nx = x1 - x;
    double ny = y1 - y;
    return sqrt(nx * nx + ny * ny);
}
double lengthArrayC(double* a, size_t length) {
    double c = 0;
    for (size_t i = 0; i < length; i++) {
        double a1 = a[i + 0];
        double a2 = a[i + 1];
        double a3 = a[i + 2];
        double a4 = a[i + 3];
        c += lengthC(a1,a2,a3,a4);
        c += lengthC(a2,a3,a4,a1);
        c += lengthC(a3,a4,a1,a2);
        c += lengthC(a4,a1,a2,a3);
    }
    return c;
}

#ifdef __wasm__
__attribute__((import_module("js"), import_name("len")))
double lengthJS(double x, double y, double x1, double y1);
double lengthArrayJS(double* a, size_t length) {
    double c = 0;
    for (size_t i = 0; i < length; i++) {
        double a1 = a[i + 0];
        double a2 = a[i + 1];
        double a3 = a[i + 2];
        double a4 = a[i + 3];
        c += lengthJS(a1,a2,a3,a4);
        c += lengthJS(a2,a3,a4,a1);
        c += lengthJS(a3,a4,a1,a2);
        c += lengthJS(a4,a1,a2,a3);
    }
    return c;
}

__attribute__((import_module("bench"), import_name("now")))
double now();

__attribute__((import_module("bench"), import_name("result")))
void printtime(int benchidx, double ns);
#else
void printtime(int benchidx, double ns) {
    if (benchidx == 1) {
        printf("C: %f ns\n", ns);
    } else if (benchidx == 0) {
        printf("avoid the optimizer: %f\n", ns);
    } else { 
        fprintf(stderr, "Unknown benchmark: %d", benchidx);
        exit(-1);
    }
}
double now() {
    struct timespec ts;
    if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
        return (double)ts.tv_sec + (double)ts.tv_nsec / 1e9;
    } else {
        return sqrt(-1);
    }
}
#endif

#define iters 1000000
double a[iters+3];

int main() {
    int bigCount = 0;
    srand(now());
    for (size_t i = 0; i < iters + 3; i++)
         a[i] = (double)rand()/RAND_MAX*2e5-1e5;
    
    for (int i = 0; i < 10; i++) {
        double startTime, endTime;
        double c;
        startTime = now();
        c = lengthArrayC(a, iters);
        endTime = now();
        bigCount = (bigCount + (int64_t)c) % 1000;
        printtime(1, (endTime - startTime) * 1e9 / iters / 4);
#ifdef __wasm__
        startTime = now();
        c = lengthArrayJS(a, iters);
        endTime = now();
        bigCount = (bigCount + (int64_t)c) % 1000;
        printtime(2, (endTime - startTime) * 1e9 / iters / 4);
#endif
    }
    printtime(0, bigCount);
    return 0;
}

Compile it with clang 12.0.1:

clang -O3 -target wasm32-wasi --sysroot /opt/wasi-sdk/wasi-sysroot/ foo2.c -o foo2.wasm

And provide it with a length function from JS via imports:

"use strict";
(async (wasm) => {
    const wasmbytes = new Uint8Array(wasm.length);
    for (var i in wasm)
        wasmbytes[i] = wasm.charCodeAt(i);
    (await WebAssembly.instantiate(wasmbytes, {
        js: {
            len: function (x,y,x1,y1) {
                var nx = x1 - x;
                var ny = y1 - y;
                return Math.sqrt(nx * nx + ny * ny);
            }
        },
        bench: {
            now: () => window.performance.now() / 1e3,
            result: (bench, ns) => {
                let name;
                if (bench == 1) { name = "C" }
                else if (bench == 2) { name = "JS" }
                else if (bench == 0) { console.log("Optimizer confuser: " + ns); /*not really necessary*/; return; }
                else { throw "unknown bench"; }
                console.log(name + ": " + ns + " ns");
            },
        },
    })).instance.exports._start();
})(atob('AGFzbQEAAAABFQRgBHx8fHwBfGAAAXxgAn98AGAAAAIlAwJqcwNsZW4AAAViZW5jaANub3cAAQViZW5jaAZyZXN1bHQAAgMCAQMFAwEAfAcTAgZtZW1vcnkCAAZfc3RhcnQAAwr2BAHzBAMIfAJ/An5BmKzoAwJ/EAEiA0QAAAAAAADwQWMgA0QAAAAAAAAAAGZxBEAgA6sMAQtBAAtBAWutNwMAQejbl3whCANAQZis6ANBmKzoAykDAEKt/tXk1IX9qNgAfkIBfCIKNwMAIAhBmKzoA2ogCkIhiKe3RAAAwP///99Bo0QAAAAAAGoIQaJEAAAAAABq+MCgOQMAIAhBCGoiCA0ACwNAEAEhBkGQCCsDACEBQYgIKwMAIQRBgAgrAwAhAEQAAAAAAAAAACECQRghCANAIAQhAyABIgQgAKEiASABoiIHIAMgCEGACGorAwAiAaEiBSAFoiIFoJ8gACAEoSIAIACiIgAgBaCfIAAgASADoSIAIACiIgCgnyACIAcgAKCfoKCgoCECIAMhACAIQQhqIghBmKToA0cNAAtBARABIAahRAAAAABlzc1BokQAAAAAgIQuQaNEAAAAAAAA0D+iEAICfiACmUQAAAAAAADgQ2MEQCACsAwBC0KAgICAgICAgIB/CyALfEQAAAAAAAAAACECQYDcl3whCBABIQMDQCACIAhBgKzoA2orAwAiBSAIQYis6ANqKwMAIgEgCEGQrOgDaisDACIAIAhBmKzoA2orAwAiBBAAoCABIAAgBCAFEACgIAAgBCAFIAEQAKAgBCAFIAEgABAAoCECIAhBCGoiCA0AC0ECEAEgA6FEAAAAAGXNzUGiRAAAAACAhC5Bo0QAAAAAAADQP6IQAkLoB4EhCgJ+IAKZRAAAAAAAAOBDYwRAIAKwDAELQoCAgICAgICAgH8LIAp8QugHgSELIAlBAWoiCUEKRw0AC0EAIAuntxACCwB2CXByb2R1Y2VycwEMcHJvY2Vzc2VkLWJ5AQVjbGFuZ1YxMS4wLjAgKGh0dHBzOi8vZ2l0aHViLmNvbS9sbHZtL2xsdm0tcHJvamVjdCAxNzYyNDliZDY3MzJhODA0NGQ0NTcwOTJlZDkzMjc2ODcyNGE2ZjA2KQ=='))

Now, calling the JS function from WASM is unsurprisingly a lot slower than calling the WASM function from WASM. (In fact, WASM→WASM it isn't calling. You can see the f64.sqrt being inlined into _start.)

(One last interesting datapoint is that WASM→WASM and JS→JS seem to have about the same cost (about 1.5 ns per inlined length(…) on my E3-1280). Disclaimer: It's entirely possible that my benchmark is even more broken than the original question.)

Conclusion

WASM isn't slow, crossing the border is. For now and the foreseeable future, don't put things into WASM unless they're a significant computational task. (And even then, it depends. Sometimes, JS engines are really smart. Sometimes.)

Upvotes: 1

ColinE
ColinE

Reputation: 70122

Andreas describes a number of good reasons why the JavaScript implementation was initially observed to be x300 faster. However, there are a number of other issues with your code.

  1. This is a classic 'micro benchmark', i.e. the code that you are testing is so small, that the other overheads within your test loop are a significant factor. For example, there is an overhead in calling WebAssembly from JavaScript, which will factor in your results. What are you trying to measure? raw processing speed? or the overhead of the language boundary?
  2. Your results vary wildly, from x300 to x2, due to small changes in your test code. Again, this is a micro benchmark issue. Others have seen the same when using this approach to measure performance, for example this post claims wasm is x84 faster, which is clearly wrong!
  3. The current WebAssembly VM is very new, and an MVP. It will get faster. Your JavaScript VM has had 20 years to reach its current speed. The performance of the JS <=> wasm boundary is being worked on and optimised right now.

For a more definitive answer, see the joint paper from the WebAssembly team, which outlines an expected runtime performance gain of around 30%

Finally, to answer your point:

Whats the point of WebAssembly if it does not optimise

I think you have misconceptions around what WebAssembly will do for you. Based on the paper above, the runtime performance optimisations are quite modest. However, there are still a number of performance advantages:

  1. Its compact binary format mean and low level nature means the browser can load, parse and compile the code much faster than JavaScript. It is anticipated that WebAssembly can be compiled faster than your browser can download it.
  2. WebAssembly has a predictable runtime performance. With JavaScript the performance generally increases with each iteration as it is further optimised. It can also decrease due to se-optimisation.

There are also a number of non-performance related advantages too.

For a more realistic performance measurement, take a look at:

Both are practical, production codebases.

Upvotes: 27

Andreas Rossberg
Andreas Rossberg

Reputation: 36088

The JS engine can apply a lot of dynamic optimisations to this example:

  1. Perform all calculations with integers and only convert to double for the final call to Math.sqrt.

  2. Inline the call to the len function.

  3. Hoist the computation out of the loop, since it always computes the same thing.

  4. Recognise that the loop is left empty and eliminate it entirely.

  5. Recognise that the result is never returned from the testing function, and hence remove the entire body of the test function.

All but (4) apply even if you add the result of every call. With (5) the end result is an empty function either way.

With Wasm an engine cannot do most of these steps, because it cannot inline across language boundaries (at least no engine does that today, AFAICT). Also, for Wasm it is assumed that the producing (offline) compiler has already performed relevant optimisations, so a Wasm JIT tends to be less aggressive than one for JavaScript, where static optimisation is impossible.

Upvotes: 7

Related Questions