Mendy
Mendy

Reputation: 8662

WebAssembly SIMD without memory

Is it possible to use SIMD without loading it into memory? The only way I can get it to work is by loading it's value into memory and then reading it from memory. Is this really the only way to interact with SIMD values? Can't it read and write from the stack?

This is the only solution I can get to compile, am I missing something or is this the only way?

(module
  
  (import "console" "log" (func $log (param i32 i32 i32 i32)))
  (func $main
    
    i32.const 0
    v128.const i32x4 1 2 3 4
    v128.store
    
    i32.const 0
    i32.load
    i32.const 4
    i32.load
    i32.const 8
    i32.load
    i32.const 12
    i32.load
    call $log
  )
  (start $main)
  (memory $memory (export "memory") 1)
)

(Solutions in other languages would also be helpful, as long as they don't need memory to read and write SIMD values.)

I'm new to SIMD so any pointers would be greatly appreciated!

Upvotes: 3

Views: 284

Answers (1)

Seth P
Seth P

Reputation: 463

I'm far from an expert in Wasm SIMD, but I came up with this attempt to compute population counts (hamming weights) using the i8x16.popcnt instruction from two i64 arguments:

(module
  (func (export "v128.popcnt") (param i64 i64) (result i32)
      (local $v v128)
      ;; cf. https://godbolt.org/z/GfzM9Y83d
      local.get 0
      i64x2.splat
      local.get 1
      i64x2.replace_lane 1

      i8x16.popcnt
      i16x8.extadd_pairwise_i8x16_u
      i32x4.extadd_pairwise_i16x8_u
     
      local.tee $v
      i32x4.extract_lane 0
      local.get $v
      i32x4.extract_lane 1
      local.get $v
      i32x4.extract_lane 2
      local.get $v
      i32x4.extract_lane 3
        
      i32.add
      i32.add
      i32.add))

Plugging that into https://webassembly.github.io/wabt/demo/wat2wasm/ with a test program like

const wasmInstance =
      new WebAssembly.Instance(wasmModule, {});
const popcnt = wasmInstance.exports['v128.popcnt'];

const uint64max = 0xFFFF_FFFF_FFFF_FFFFn;

console.log(popcnt(uint64max, uint64max - 1n));

does produce the expected result (in this case, 127). This is, as @ovinus-real suggested, a combination of replace_lane to get the data in vector form, and then extract_lane to get it back out again. No memory required!

That said, beyond being functional, I can't vouch for the tradeoffs at hand, e.g. is it better to use splat and one replace_lane, or to use a v128.const 0 and two replace_lanes?

As far as further pointers go: running that js test program in node with the --print-wasm-code arg produces a listing of the assembly that got produced on my platform, and "warming" it by adding a for (let i = 0; i < 100_000; i++) popcnt(0n, 0n) engaged the optimizing compiler to produce another listing. I suppose an expert in x86 SIMD could probably look at those and do some targeted benchmarks to evaluate the different options. At the least, "how do I make this platform-specific instruction sequence use the platform more efficiently" is a question that has more ready answers than "how do I make this platform-independent instruction sequence work optimally across many disparate platforms".

Upvotes: 0

Related Questions