Reputation: 73
I have a long byte array and I want to remove the lower nibble (the lower 4 bits) of every byte and move the rest together such that the result occupies half the space as the input.
For example, if my input is 057ABC23
, my output should be 07B2
.
My current approach looks like this:
// in is unsigned char*
size_t outIdx = 0;
for(size_t i = 0; i < input_length; i += 8)
{
in[outIdx++] = (in[i ] & 0xF0) | (in[i + 1] >> 4);
in[outIdx++] = (in[i + 2] & 0xF0) | (in[i + 3] >> 4);
in[outIdx++] = (in[i + 4] & 0xF0) | (in[i + 5] >> 4);
in[outIdx++] = (in[i + 6] & 0xF0) | (in[i + 7] >> 4);
}
... where I basically process 8 bytes of input in every loop, to illustrate that I can assume input_length
to be divisible by 8 (even though it's probably not faster than processing only 2 bytes per loop). The operation is done in-place, overwriting the input array.
Is there a faster way to do this? For example, since I can read in 8 bytes at a time anyway, the operation could be done on 4-byte or 8-byte integers instead of individual bytes, but I cannot think of a way to do that. The compiler doesn't come up with something itself either, as I can see the output code still operates on bytes (-O3 seems to do some loop unrolling, but that's it).
I don't have control over the input, so I cannot store it differently to begin with.
Upvotes: 3
Views: 740
Reputation: 1185
All x64-86 (and most x86) cpus have SSE2.
For each 16-bit lane do
t = (x & 0x00F0) | (x >> 12)
.
Then use the pack instruction to truncate each 16-bit lane to 8-bits.
For example, 0xABCD1234
would become 0x00CA0031
then the pack would make it 0xCA31
.
#include <emmintrin.h>
void squish_32bytesTo16 (unsigned char* src, unsigned char* dst) {
const __m128i mask = _mm_set1_epi16(0x00F0);
__m128i src0 = _mm_loadu_si128((__m128i*)(void*)src);
__m128i src1 = _mm_loadu_si128((__m128i*)(void*)(src + sizeof(__m128i)));
__m128i t0 = _mm_or_si128(_mm_and_si128(src0, mask), _mm_srli_epi16(src0, 12));
__m128i t1 = _mm_or_si128(_mm_and_si128(src1, mask), _mm_srli_epi16(src1, 12));
_mm_storeu_si128((__m128i*)(void*)dst, _mm_packus_epi16(t0, t1));
}
Upvotes: 1
Reputation: 73
Just to put the resulting code here for future reference, it now looks like this (assuming the system is little endian, and the input length is a multiple of 8 bytes):
void compress(unsigned char* in, size_t input_length)
{
unsigned int* inUInt = reinterpret_cast<unsigned int*>(in);
unsigned long long* inULong = reinterpret_cast<unsigned long long*>(in);
for(size_t i = 0; i < input_length / 8; ++i)
{
unsigned long long value = inULong[i] & 0xF0F0F0F0F0F0F0F0;
value = (value >> 4) | (value << 8);
value &= 0xFF00FF00FF00FF00;
value |= (value << 8);
value &= 0xFFFF0000FFFF0000;
value |= (value << 16);
inUInt[i] = static_cast<unsigned int>(value >> 32);
}
}
Benchmarked very roughly it's around twice as fast as the code in the question (using MSVC19 /O2).
Note that this is basically the solution anatolyg posted before (just put into code), so upvote that answer instead if you found this helpful.
Upvotes: 0
Reputation: 28251
There is a general technique for bit-fiddling to swap bits around. Suppose you have a 64-bit number, containing the following nibbles:
HxGxFxExDxCxBxAx
Here by x
I denote a nibble whose value is unimportant (you want to delete it). The result of your bit-operation should be a 32-bit number HGFEDCBA
.
First, delete all the x
nibbles:
HxGxFxExDxCxBxAx & *_*_*_*_*_*_*_*_ = H_G_F_E_D_C_B_A_
Here I denote 0 by _
, and binary 1111
by *
for clarity.
Now, replicate your data:
H_G_F_E_D_C_B_A_ << 4 = _G_F_E_D_C_B_A__
H_G_F_E_D_C_B_A_ | _G_F_E_D_C_B_A__ = HGGFFEEDDCCBBAA_
Notice how some of your target nibbles are together. You need to retain these places, and delete duplicate data.
HGGFFEEDDCCBBAA_ & **__**__**__**__ = HG__FE__DC__BA__
From here, you can extract the result bytes directly, or do another iteration or two of the technique.
Next iteration:
HG__FE__DC__BA__ << 8 = __FE__DC__BA____
HG__FE__DC__BA__ | __FE__DC__BA____ = HGFEFEDCDCBABA__
HGFEFEDCDCBABA__ & ****____****____ = HGFE____DCBA____
Last iteration:
HGFE____DCBA____ << 16 = ____DCBA________
HGFE____DCBA____ | ____DCBA________ = HGFEDCBADCBA____
HGFEDCBADCBA____ >> 32 = ________HGFEDCBA
Upvotes: 3