Most efficient small-word-size multiply for processors without a hardware multiplier

Question

I'm hoping to use the CH32V003 (an RV32EC processor) to do ColorChord, which makes extensive use of multiply-add's to perform DFTs. But it can operate with very low bit depths, 16- or even 8-bit multiplies. But, the RV32EC in the CH32V003 doesn't support the RV32 multiply extension.

I've tried exploring options in godbolt, see https://godbolt.org/z/zqTEaeecr to see what the compiler would do in these situations, but it seems to only call __mulsi3, which performs a naive 32-bit multiply. https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/epiphany/mulsi3.c

What I'm hoping is that there's some ultra efficient route to do something like a combined multiply-and-shift for different situations.

Is there a good guide or discussion surrounding performing extremely efficient multiplies of special combinations of bit widths and signeness for architectures that don't have hardware multiply?

Most efficient small-word-size multiply for processors without a hardware multiplier

Answers (1)

Related Questions