Reputation: 8662
Is it possible to use SIMD without loading it into memory? The only way I can get it to work is by loading it's value into memory and then reading it from memory. Is this really the only way to interact with SIMD values? Can't it read and write from the stack?
This is the only solution I can get to compile, am I missing something or is this the only way?
(module
(import "console" "log" (func $log (param i32 i32 i32 i32)))
(func $main
i32.const 0
v128.const i32x4 1 2 3 4
v128.store
i32.const 0
i32.load
i32.const 4
i32.load
i32.const 8
i32.load
i32.const 12
i32.load
call $log
)
(start $main)
(memory $memory (export "memory") 1)
)
(Solutions in other languages would also be helpful, as long as they don't need memory to read and write SIMD values.)
I'm new to SIMD so any pointers would be greatly appreciated!
Upvotes: 3
Views: 284
Reputation: 463
I'm far from an expert in Wasm SIMD, but I came up with this attempt to compute population counts (hamming weights) using the i8x16.popcnt
instruction from two i64
arguments:
(module
(func (export "v128.popcnt") (param i64 i64) (result i32)
(local $v v128)
;; cf. https://godbolt.org/z/GfzM9Y83d
local.get 0
i64x2.splat
local.get 1
i64x2.replace_lane 1
i8x16.popcnt
i16x8.extadd_pairwise_i8x16_u
i32x4.extadd_pairwise_i16x8_u
local.tee $v
i32x4.extract_lane 0
local.get $v
i32x4.extract_lane 1
local.get $v
i32x4.extract_lane 2
local.get $v
i32x4.extract_lane 3
i32.add
i32.add
i32.add))
Plugging that into https://webassembly.github.io/wabt/demo/wat2wasm/ with a test program like
const wasmInstance =
new WebAssembly.Instance(wasmModule, {});
const popcnt = wasmInstance.exports['v128.popcnt'];
const uint64max = 0xFFFF_FFFF_FFFF_FFFFn;
console.log(popcnt(uint64max, uint64max - 1n));
does produce the expected result (in this case, 127). This is, as @ovinus-real suggested, a combination of replace_lane
to get the data in vector form, and then extract_lane
to get it back out again. No memory required!
That said, beyond being functional, I can't vouch for the tradeoffs at hand, e.g. is it better to use splat
and one replace_lane
, or to use a v128.const 0
and two replace_lane
s?
As far as further pointers go: running that js test program in node with the --print-wasm-code
arg produces a listing of the assembly that got produced on my platform, and "warming" it by adding a for (let i = 0; i < 100_000; i++) popcnt(0n, 0n)
engaged the optimizing compiler to produce another listing. I suppose an expert in x86 SIMD could probably look at those and do some targeted benchmarks to evaluate the different options. At the least, "how do I make this platform-specific instruction sequence use the platform more efficiently" is a question that has more ready answers than "how do I make this platform-independent instruction sequence work optimally across many disparate platforms".
Upvotes: 0