Efficient tiny boolean matrix multiplication

Question

I have some unsigned 16 bit integer s which I'd like to map to an unsigned 32 bit integer r in such a way that each flipped bit in s flips at most one (given) bit in r -- simply a mapping between 0..16 and 0..32 that is. So we can see this as a matrix equation

Ps = r

where P is a 32 x 16 boolean matrix, s is a 16 x 1 boolean vector and r is 32 x 1 boolean vector. I have a gut feeling there exists some super simple hack that I'm missing. Important note: the target machine is a 16 bit mcu!

Here's the best I can do:

static u16 P[32] = someArrayOrWhatever();

u32 FsiPermutationHack(u16 s) {
    u32 r;
    for (u16 i = 0; i < 32; i++)
    {
            r |= ((u32)((P[i] & s) > 0) << i);
    }
    return r;
}

The rationale is this: the i:th bit of r is 1 if and only if (P[i] & s) != 0x0000. I am too stupid to disassemble stuff, but I am guessing this would be like ~100 instructions IF we didn't have to do that stupid u32 cast. But then again, perhaps the compiler auto-splits the loop in two for us in which case it's looking pretty good for us.

Apologies for the tangent, just thought I'd share my attempted solution -- do you have a better one?

John Bollinger · Accepted Answer

Inasmuch as you say,

I am guessing this would be like ~100 instructions IF we didn't have to do that stupid u32 cast. But then again, perhaps the compiler auto-splits the loop in two for us in which case it's looking pretty good for us.

and

I have a gut feeling there exists some super simple hack that I'm missing

, I will interpret you to be asking how to minimize the use of 32-bit arithmetic in this code intended for a 16-bit processor.

You really ought to learn how to disassemble and check the compiled result to see whether the compiler does automatically split the loop as you hypothesize, but supposing that it does not, I don't see why you couldn't do the same manually:

static u16 P[32];  /* value assigned elsewhere */

u32 FsiPermutationHack(u16 s) {
    u16 *P_hi = P + 16;
    u16 r_lo = 0;
    u16 r_hi = 0;

    for (u16 i = 0; i < 16; i++) {
        r_lo |= (P[i] & s) != 0) << i;
        r_hi |= (P_hi[i] & s) != 0) << i;
    }

    return ((u32) r_hi << 16) + r_lo;
}

That supposes u16 and u32 to be unsigned 16-bit and 32-bit (respectively) integers with no padding bits.

Note also that the idea that performing arithmetic with type u16 instead of u32 should be an improvement assumes that type u32 has a higher integer promotion rank than unsigned int. Roughly speaking, that comes down to the implementation's unsigned int being a 16-bit type. That's entirely plausible for an implementation for a 16-bit processor. On a system whose int and unsigned int are instead 32-bit types, however, all narrower integer arithmetic arguments would be promoted to 32 bits anyway.

Update:

As far as the possibility of a better alternative algorithm, I observe that each bit of the result is computed from a different element of array P, that the whole value of each element is used, and that the element size is the same as the target machine's native word size. There seems then no scope for performing fewer 16-bit bitwise AND operations than there are array elements (but see below).

If we accept that each array element must be processed separately, then the provided implementation does a pretty good job of approaching it efficiently:

It performs only 16-bit computations until the time comes to assemble the final result;
It computes both the upper and lower halves of the result in the same loop, thus incurring only 16 iterations' worth of loop overhead instead of 32
It largely removes the extra indexing arithmetic that that would otherwise have required by creating P_hi for accessing the upper half of the array

It would be possible to manually unroll the loop to possibly save a few more cycles, but that's the kind of optimization that you absolutely should rely on your compiler to perform for you.

As far as "bit twiddling hacks", the only scope I see for anything of that nature would be processing adjacent pairs of 16-bit array elements as 32-bit unsigned integers. That would allow performing one 32-bit bitwise AND in place of each two 16-bit ANDs. That would be coupled with two 32-bit comparisons (vs. two 16-bit comparisons in the above code). The 16-bit shift and bitwise OR operations of the above approach could be retained. Aside from that having formally undefined behavior as a result of violating the strict aliasing rule, that would involve 32-bit arithmetic, which presumably is about half as fast as 16-bit arithmetic on your 16-bit machine. Performance is better measured than predicted, but I don't see any reason to expect a significant win from that approach.

Efficient tiny boolean matrix multiplication

Answers (1)

Related Questions