Jacko
Jacko

Reputation: 13195

Convert each bit in byte to first bit of each nibble in 32 bit int

I have a byte b. I am looking for the most efficient bit manipulation to convert each bit in b to the first bit of each nibble in a 32 bit int x.

For example, if b = 01010111, then x = 0x10101111

I know I can do a brute force approach:

x = (b&1) | (((b>>1)&1)<<4) | ......

Edit: this for an OpenCL kernel for GPU

Upvotes: 1

Views: 207

Answers (1)

BeeOnRope
BeeOnRope

Reputation: 64955

PDEP

As user harold mentioned in the comments, PDEP is the instruction that just does exactly what you want - but it's only available on x86 (as far as I know), and it has terrible1 performance on the newest AMD chips.

LUT

Barring that, a lookup table of 256 x 4-byte entries seems reasonable - at the cost of 1K of pressure on your cache subsystem. You'll find a lot of smart people advocate against LUTs due to the hidden cost of cache misses - but if this particular operation is in fact "hot" then it may turn out to be the fastest even when factoring in any additional misses.

As with any LUT solution, you should be especially careful to benchmark it not only with micro-benchmarks, but in the full application to evaluate the effect of memory pressure.

You could also consider a compromise split-LUT solution that uses one or two 16-entry LUTs for each nibble of the byte, where the result is calculated something like:

int32 x = high_lut[(b & 0xF0) >> 4] | low_lut[b & 0xF]

This cuts the size of the LUTs down by a factor of between ~11 to 322, since we have much fewer entries and some entries can be 2 bytes rather than 4 bytes.

Bit Manipulation

If you really want a bit manipulation solution, to impress your inlaws or something, you can try something like the following:

  1. Split the byte into nibbles and use multiplication by 0x00001111 (low nibble) and 0x01111000 (high nibble) to splat the low (resp. high) nibble into the low (resp high) half of the 4-byte word, and combine the results with or or add. So if your byte had bits abcd efgh you'll have a word like abcd abcd abcd abcd efgh efgh efgh efgh.
  2. and this result with a mask that picks out the bit that belongs in each nibble (although it usually won't be in the right place). The mask is something like 0x84218421 and the result (in binary) will be something like a000 0b00 00c0 000d e000 0f00 00g0 000h.
  3. Now move the 6 out of 8 bits that aren't in the high bit to the right position using the carry behavior of subtraction, something like: ((x | 0x08880888) - 0x01110111) ^ 0x08880888.

The basic idea in the last step is that you set the high bit of each nibble, and subtract 1 from the nibble. So for example, you have the 0b00 nibble, which becomes 1b00 - 1 - the subtraction carries though all the zeros, and stops at the first one, which is either the high bit (b is zero) or b if it is one. So you effectively set the high bit based on the value of the selected bit. Note that you don't need to do this for a or e since they are already in the right place.

The final xor is needed because the above actually sets the high bit to the opposite value as the selected bit, so we need to flip it.

I didn't try it out, so there are no doubt bugs, but the basic idea should be sound. There is probably various ways to optimize it further, but it's not too bad as is: a couple of multiplications and perhaps a half-dozen bit-operations. On platforms with slow multiplications you can probably find another approach for the first step that uses only 1 multiplication combined with a few more primitive operations, or zero at the cost of several more operations.


1 Fully 18x worse throughput than Intel - evidently AMD opted not to implement the circuit to do PDEP in hardware and instead implement it via a series of more elementary operations.

2 The largest reduction is if you share a single 16-entry LUT for both the high and low nibble, although this requires an additional shift for the result of the high nibble lookup. The smaller reduction, shown in the example, uses two 16-entry LUTs: one 4-byte one for the high nibble, and a 2-byte one for the low nibble, and avoids the shift.

Upvotes: 1

Related Questions