Reputation: 44817
If I have two 32-bit values, X and Y, how can I efficiently interleave their bits into one 64-bit value Z, in the order xyxyxyxy... (Z is a position on a Z-order curve.)
I could iterate over each of the bits in X and Y, setting bits in Z as I go. This seems inefficient.
Is there a shortcut to interleaving the bits from two values into one large value, which takes less than a hundred CPU instructions?
Upvotes: 2
Views: 310
Reputation: 8444
Are you having to do this a lot, as in have you got a lot of X's to interleave with a lot Y's? If not, the below is merely for interest only.
If yes, there is an "orthogonal" approach. Say you have 128 X's to interleave with 128 Y's. Start off with 32 128bit vectors, and put the bits for X[0] in the 0'th bit of the 32 vectors. Put the bits for X[1] in the 1'th bit of the vectors, and so on. Do the same for the Y values into another set of 32 128bit vectors. This is effectively transposing the bits in the values to bit columns in a set of vectors. Then, interleaving them is simply a case of indexing alternately an X vector then a Y vector as you re-build the 64bit values from the two sets of vectors.
Admittedly for mere interleaving, the transposition of the ints to a bit column in the array of vectors may be inefficient. But, if there's more bit manipulation to be done later on then a lot of work can be reduced to re-indexing of vectors in an array. Some SIMD units (like Altivec) have vector instructions that can help with the transpositions.
Bitwise logical operations (e.g. AND) can still be done too (1 bit at a time, but for 128 different values at a time), but there's no speed advantage. However in the already-transposed representation there's no disadvantage either. If one's processing is a sequence of bit manipulations and bitwise operations, the bit manipulations are effectively "free", which is where time can be saved.
Upvotes: 1
Reputation: 1907
For what it's worth, here is a manually vectorised version which does the bit operations on x and y at the same time using SSE intrinsics. The gcc compiler however, is too clever for me and it optimizes nielsen's scalar function more efficiently somehow.
Anyway, this works but is probably only useful to someone who might want to adapt it to work with a wider vector width to interleave more than one pair of uint32_ts at once.
uint64_t interleavesse2(uint32_t x0, uint32_t y0) {
// Works but is almost 50% slower!! than scalar version when compiled with gcc -O3 -msse2
// Materially faster than when scalar version compiled with gcc -O2 -msse2
__m128i mask = _mm_setr_epi32(0xffff, 0xffff, 0xffff, 0xffff);
__m128i z = _mm_setr_epi32(x0, 0, y0, 0);
z = _mm_and_si128(_mm_or_si128(z, _mm_slli_epi64(z, 16)), mask);
mask = _mm_setr_epi32(0xff00ff, 0xff00ff, 0xff00ff, 0xff00ff);
z = _mm_and_si128(_mm_or_si128(z, _mm_slli_epi64(z, 8)), mask);
mask = _mm_setr_epi32(0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f, 0xf0f0f0f);
z = _mm_and_si128(_mm_or_si128(z, _mm_slli_epi64(z, 4)), mask);
mask = _mm_setr_epi32(0x33333333, 0x33333333, 0x33333333, 0x33333333);
z = _mm_and_si128(_mm_or_si128(z, _mm_slli_epi64(z, 2)), mask);
mask = _mm_setr_epi32(0x55555555, 0x55555555, 0x55555555, 0x55555555);
z = _mm_and_si128(_mm_or_si128(z, _mm_slli_epi64(z, 1)), mask);
uint64_t res[2];
memcpy(res, &z, 16);
return res[0] | (res[1] << 1);
}
Upvotes: 0
Reputation: 7794
This C++-answer also applies to C: https://stackoverflow.com/a/39490836/11993121
The answer outlines the principle, but does not write out the complete solution. A working implementation is as follows:
#include <stdint.h>
uint64_t interleave(uint32_t x0, uint32_t y0)
{
static const uint64_t B[] = {0x0000FFFF0000FFFF, 0x00FF00FF00FF00FF, 0x0F0F0F0F0F0F0F0F, 0x3333333333333333, 0x5555555555555555};
static const unsigned S[] = {16, 8, 4, 2, 1};
uint64_t x = x0;
uint64_t y = y0;
for(unsigned i = 0; i < sizeof(B)/sizeof(B[0]); i++)
{
x = (x | (x << S[i])) & B[i];
y = (y | (y << S[i])) & B[i];
}
return x | (y << 1);
}
Example of test:
#include <stdio.h>
void printBinary64(uint64_t x)
{
uint64_t bit = ((uint64_t)1 << 63);
for(unsigned i = 0; i < 64; i++)
{
printf("%c", (x&bit) ? '1' : '0');
bit = bit >> 1;
}
}
void printBinary32(uint32_t x)
{
uint32_t bit = ((uint32_t)1 << 31);
for(unsigned i = 0; i < 32; i++)
{
printf("%c ", (x&bit) ? '1' : '0');
bit = bit >> 1;
}
}
int main(void)
{
uint32_t x = 0x01234567;
uint32_t y = 0xFEDCBA98;
printf(" ");
printBinary32(x);
printf("\n");
printBinary32(y);
printf("\n");
printBinary64(interleave(x,y));
printf("\n");
}
Upvotes: 3