Reputation: 7813
I have a bit-mask of N chars in size, which is statically known (i.e. can be calculated at compile time, but it's not a single constant, so I can't just write it down), with bits set to 1 denoting the "wanted" bits. And I have a value of the same size, which is only known at runtime. I want to collect the "wanted" bits from that value, in order, into the beginning of a new value. For simplicity's sake let's assume the number of wanted bits is <= 32.
Completely unoptimized reference code which hopefully has the correct behaviour:
template<int N, const char mask[N]>
unsigned gather_bits(const char* val)
{
unsigned result = 0;
char* result_p = (char*)&result;
int pos = 0;
for (int i = 0; i < N * CHAR_BIT; i++)
{
if (mask[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (val[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (pos < sizeof(unsigned) * CHAR_BIT)
{
result_p[pos/CHAR_BIT] |= 1 << (pos % CHAR_BIT);
}
else
{
abort();
}
}
pos += 1;
}
}
return result;
}
Although I'm not sure whether that formulation actually allows access to the contents of the mask at compile time. But in any case, it's available for use, maybe a constexpr
function or something would be a better idea. I'm not looking here for the necessary C++ wizardry (I'll figure that out), just the algorithm.
An example of input/output, with 16-bit values and imaginary binary notation for clarity:
mask = 0b0011011100100110
val = 0b0101000101110011
--
wanted = 0b__01_001__1__01_ // retain only those bits which are set in the mask
result = 0b0000000001001101 // bring them to the front
^ gathered bits begin here
My questions are:
What's the most performant way to do this? (Are there any hardware instructions that can help?)
What if both the mask and the value are restricted to be unsigned
, so a single word, instead of an unbounded char array? Can it then be done with a fixed, short sequence of instructions?
Upvotes: 4
Views: 307
Reputation: 64913
There will pext
(parallel bit extract) that does exactly what you want in Intel Haswell. I don't know what the performance of that instruction will be, probably better than the alternatives though. This operation is also known as "compress-right" or simply "compress", the implementation from Hacker's Delight is this:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Upvotes: 4