Hovering
Hovering

Reputation: 73

Efficiently removing lower half-byte in a byte array - C++

I have a long byte array and I want to remove the lower nibble (the lower 4 bits) of every byte and move the rest together such that the result occupies half the space as the input.

For example, if my input is 057ABC23, my output should be 07B2.

My current approach looks like this:

// in is unsigned char*
size_t outIdx = 0;
for(size_t i = 0; i < input_length; i += 8) 
{
    in[outIdx++] = (in[i    ] & 0xF0) | (in[i + 1] >> 4);
    in[outIdx++] = (in[i + 2] & 0xF0) | (in[i + 3] >> 4);
    in[outIdx++] = (in[i + 4] & 0xF0) | (in[i + 5] >> 4);
    in[outIdx++] = (in[i + 6] & 0xF0) | (in[i + 7] >> 4);
}

... where I basically process 8 bytes of input in every loop, to illustrate that I can assume input_length to be divisible by 8 (even though it's probably not faster than processing only 2 bytes per loop). The operation is done in-place, overwriting the input array.

Is there a faster way to do this? For example, since I can read in 8 bytes at a time anyway, the operation could be done on 4-byte or 8-byte integers instead of individual bytes, but I cannot think of a way to do that. The compiler doesn't come up with something itself either, as I can see the output code still operates on bytes (-O3 seems to do some loop unrolling, but that's it).

I don't have control over the input, so I cannot store it differently to begin with.

Upvotes: 3

Views: 740

Answers (3)

aqrit
aqrit

Reputation: 1185

All x64-86 (and most x86) cpus have SSE2.

For each 16-bit lane do t = (x & 0x00F0) | (x >> 12). Then use the pack instruction to truncate each 16-bit lane to 8-bits. For example, 0xABCD1234 would become 0x00CA0031 then the pack would make it 0xCA31.

#include <emmintrin.h>

void squish_32bytesTo16 (unsigned char* src, unsigned char* dst) {
    const __m128i mask = _mm_set1_epi16(0x00F0);
    __m128i src0 = _mm_loadu_si128((__m128i*)(void*)src);
    __m128i src1 = _mm_loadu_si128((__m128i*)(void*)(src + sizeof(__m128i)));
    __m128i t0 = _mm_or_si128(_mm_and_si128(src0, mask), _mm_srli_epi16(src0, 12));
    __m128i t1 = _mm_or_si128(_mm_and_si128(src1, mask), _mm_srli_epi16(src1, 12));
    _mm_storeu_si128((__m128i*)(void*)dst, _mm_packus_epi16(t0, t1));
}

Upvotes: 1

Hovering
Hovering

Reputation: 73

Just to put the resulting code here for future reference, it now looks like this (assuming the system is little endian, and the input length is a multiple of 8 bytes):

void compress(unsigned char* in, size_t input_length)
{
    unsigned int* inUInt = reinterpret_cast<unsigned int*>(in);
    unsigned long long* inULong = reinterpret_cast<unsigned long long*>(in);

    for(size_t i = 0; i < input_length / 8; ++i) 
    {
        unsigned long long value = inULong[i] & 0xF0F0F0F0F0F0F0F0;
        value = (value >> 4) | (value << 8);
        value &= 0xFF00FF00FF00FF00;
        value |= (value << 8);
        value &= 0xFFFF0000FFFF0000;
        value |= (value << 16);
        inUInt[i] = static_cast<unsigned int>(value >> 32);
    }
}

Benchmarked very roughly it's around twice as fast as the code in the question (using MSVC19 /O2).

Note that this is basically the solution anatolyg posted before (just put into code), so upvote that answer instead if you found this helpful.

Upvotes: 0

anatolyg
anatolyg

Reputation: 28251

There is a general technique for bit-fiddling to swap bits around. Suppose you have a 64-bit number, containing the following nibbles:

HxGxFxExDxCxBxAx

Here by x I denote a nibble whose value is unimportant (you want to delete it). The result of your bit-operation should be a 32-bit number HGFEDCBA.

First, delete all the x nibbles:

HxGxFxExDxCxBxAx & *_*_*_*_*_*_*_*_ = H_G_F_E_D_C_B_A_

Here I denote 0 by _, and binary 1111 by * for clarity.

Now, replicate your data:

H_G_F_E_D_C_B_A_ << 4 = _G_F_E_D_C_B_A__
H_G_F_E_D_C_B_A_ | _G_F_E_D_C_B_A__ = HGGFFEEDDCCBBAA_

Notice how some of your target nibbles are together. You need to retain these places, and delete duplicate data.

HGGFFEEDDCCBBAA_ & **__**__**__**__ = HG__FE__DC__BA__

From here, you can extract the result bytes directly, or do another iteration or two of the technique.

Next iteration:

HG__FE__DC__BA__ << 8 = __FE__DC__BA____
HG__FE__DC__BA__ | __FE__DC__BA____ = HGFEFEDCDCBABA__
HGFEFEDCDCBABA__ & ****____****____ = HGFE____DCBA____

Last iteration:

HGFE____DCBA____ << 16 = ____DCBA________
HGFE____DCBA____ | ____DCBA________ = HGFEDCBADCBA____
HGFEDCBADCBA____ >> 32 = ________HGFEDCBA

Upvotes: 3

Related Questions