senseiwa
senseiwa

Reputation: 2499

Optimize blockwise bit operations: base-4 numbers

This should be a fun question, at least for me.

My intent is to manipulate base-4 numbers, encoded in a unsigned integer. Each two-bits block then represents a single base-4 digit, starting from the least significant bit:

01 00 11 = base4(301)

I'd like to optimize my code using SSE instructions, because I'm not sure how I scored here, maybe poorly.

The code starts from strings (and uses them to check the correctness), and implements:

Any hints are more than welcome!

uint32_t tobin(std::string s)
{
    uint32_t v, bin = 0;

    // Convert to binary
    for (int i = 0; i < s.size(); i++)
    {
        switch (s[i])
        {
            case '0':
                v = 0;
                break;

            case '3':
                v = 3;
                break;

            case '1':
                v = 1;
                break;

            case '2':
                v = 2;
                break;

            default:
                throw "UNKOWN!";
        }

        bin = bin | (v << (i << 1));
    }

    return bin;
}

std::string tostr(int size, const uint32_t v)
{
    std::string b;

    // Convert to binary
    for (int i = 0; i < size; i++)
    {
        uint32_t shl = 0, shr = 0, q;

        shl = (3 << (i << 1));
        shr = i << 1;
        q   = v & shl;
        q   = q >> shr;

        unsigned char c = static_cast<char>(q);

        switch (c)
        {
            case 0:
                b += '0';
                break;

            case 3:
                b += '3';
                break;

            case 1:
                b += '1';
                break;

            case 2:
                b += '2';
                break;

            default:
                throw "UNKOWN!";
        }
    }

    return b;
}

uint32_t revrs(int size, const uint32_t v)
{
    uint32_t bin = 0;

    // Convert to binary
    for (int i = 0; i < size; i++)
    {
        uint32_t shl = 0, shr = 0, q;

        shl = (3 << (i << 1));
        shr = i << 1;
        q   = v & shl;
        q   = q >> shr;

        unsigned char c = static_cast<char>(q);

        shl = (size - i - 1) << 1;

        bin = bin | (c << shl);
    }

    return bin;
}

bool ckrev(std::string s1, std::string s2)
{
    std::reverse(s1.begin(), s1.end());

    return s1 == s2;
}

int main(int argc, char* argv[])
{
    // Binary representation of base-4 number
    uint32_t binr;

    std::vector<std::string> chk { "123", "2230131" };

    for (const auto &s : chk)
    {
        std::string b, r;
        uint32_t    c;

        binr = tobin(s);
        b    = tostr(s.size(), binr);
        c    = revrs(s.size(), binr);
        r    = tostr(s.size(), c);

        std::cout << "orig " << s << std::endl;
        std::cout << "binr " << std::hex << binr << " string " << b << std::endl;
        std::cout << "revs " << std::hex << c    << " string " << r << std::endl;
        std::cout << ">>> CHK  " << (s == b) << " " << ckrev(r, b) << std::endl;
    }

    return 0;
}

Upvotes: 1

Views: 394

Answers (2)

stgatilov
stgatilov

Reputation: 5533

I'll solve the problem of converting 32-bit integer to base4 string on SSE. The problem of removing leading zeros is not considered, i.e. base4 strings always have length 16.

General throughts

Clearly, we have to extract pairs of bits in vectorized form. In order to do it, we can perform some byte manipulations and bitwise operations. Let's see what we can do with SSE:

  1. A single intrinsic _mm_shuffle_epi8 (from SSSE3) allows to shuffle 16 bytes in absolutely any way you desire. Clearly, some well-structured shuffles and register mixtures can be done with simpler instructions from SSE2, but it's important to remember that any in-register shuffling can be done with one cheap instruction.

  2. Shuffling does not help to change indices of bits in a byte. In order to move chunks of bits around, we usually use bit shifts. Unfortunately, there is no way in SSE to shift different elements of XMM register by different amounts. As @PeterCorder mentioned in comments, there are such instructions in AVX2 (e.g. _mm_sllv_epi32), but they operate on at least 32-bit granularity.

From the ancient times we are constantly taught that bit shift is fast and multiplication is slow. Today arithmetic is so much accelerated, that it is no longer so. In SSE, shifts and multiplications seem to have equal throughput, although multiplications have more latency.

  1. Using multiplication by powers of two we can shift left different elements of single XMM register by different amounts. There are many instructions like _mm_mulhi_epi16, which allow 16-bit granularity. Also one instruction _mm_maddubs_epi16 allows 8-bit granularity of shifts. Right shift can be done via left shift just the same way people do division via multiplication: shift left by 16-k, then shift right by two bytes (recall that any byte shuffling is cheap).

We actually want to do 16 different bit shifts. If we use multiplication with 16-bit granularity, then we'll have to use at least two XMM registers for shifting, then they can be merged together. Also, we can try to use multiplication with 8-bit granularity to do everything in a single register.

16-bit granularity

First of all, we have to move 32-bit integer to the lower 4 bytes of XMM register. Then we shuffle bytes so that each 16-bit part of XMM register contains one byte of input:

|abcd|0000|0000|0000|   before shuffle (little-endian)
|a0a0|b0b0|c0c0|d0d0|   after shuffle (to low halves)
|0a0a|0b0b|0c0c|0d0d|   after shuffle (to high halves)

Then we can call _mm_mulhi_epi16 to shift each part right by k = 1..16. Actually, it is more convenient to put input bytes into high halves of 16-bit elements, so that we can shift left by k = -8..7. As a result, we want to see some bytes of XMM register containing the pairs of bits defining some base4 digits (as their lower bits). After that we can remove unnecessary high bits by _mm_and_si128, and shuffle valuable bytes to proper places.

Since only 8 shifts can be done at once with 16-bit granularity, we have to do the shifting part twice. Then we combine the two XMM registers into one.

Below you can see the code using this idea. It a bit optimized: there is no bytes shuffling after the bit shifts.

__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(-1, 0, -1, 0, -1, 1, -1, 1, -1, 2, -1, 2, -1, 3, -1, 3));
__m128i even = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x00100100));  //epi16:  1<<8,  1<<4  x4 times
__m128i odd  = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x04004000));  //epi16: 1<<14, 1<<10  x4 times
even = _mm_and_si128(even, _mm_set1_epi16(0x0003));
odd  = _mm_and_si128(odd , _mm_set1_epi16(0x0300));
__m128i res = _mm_xor_si128(even, odd);
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);

8-bit granularity

First of all we move our 32-bit integer into XMM register of course. Then we shuffle bytes so that each byte of result equals the input byte containing the two bits wanted at that place:

|abcd|0000|0000|0000|   before shuffle (little-endian)
|aaaa|bbbb|cccc|dddd|   after shuffle

Now we use _mm_and_si128 to filter bits: at each byte only the two bits wanted must remain. After that we only need to shift each byte right by 0/2/4/6 bits. This should be achieved with intrinsic _mm_maddubs_epi16, which allows to shift 16 bytes at once. Unfortunately, I do not see how to shift all the bytes properly with this instruction only, but at least we can shift each odd byte by 2 bits right (even bytes remain as is). Then the bytes with indices 4k+2 and 4k+3 can be shifted right by 4 bits with single _mm_madd_epi16 instruction.

Here is the resulting code:

__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3));
__m128i twobits = _mm_and_si128(bytes, _mm_set1_epi32(0xC0300C03));         //epi8: 3<<0, 3<<2, 3<<4, 3<<6  x4 times
twobits = _mm_maddubs_epi16(twobits, _mm_set1_epi16(0x4001));               //epi8: 1<<0, 1<<6  x8 times
__m128i res = _mm_madd_epi16(twobits, _mm_set1_epi32(0x10000001));          //epi16: 1<<0, 1<<12  x4 times
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);

P.S.

Both solutions use a lot of compile-time constant 128-bit values. They are not encoded into x86 instructions, so processor has to load them from memory (most likely L1 cache) each time they are used. However, if you are going to run many conversions in a loop, then the compiler would load all these constants into registers before the loop (I hope).

Here you can find the full code (without timing), including implementation of the str2bin solution by @YvesDaoust.

Upvotes: 2

user1196549
user1196549

Reputation:

This is a little challenging with SSE because there is little provision for bit packing (you want to take two bits from every character and pack them contiguously). Anyway, the special instruction _mm_movemask_epi8 can help you.

For the string-to-binary conversion, you can proceed as follows:

  • load the 16 characters string (pad with zeroes or clear after the load if necessary);

  • subtract bytewise ASCII zeroes .

  • compare bytewise 'unsigned greater than' to a string of 16 '3' bytes; this will set bytes 0xFF wherever there is an invalid character

  • use _mm_movemask_epi8 to detect such a character in the packed short value

If all is fine, you now need to pack the bit pairs. For this you need to

  • duplicate the 16 bytes

  • shift the bits of weight 1 and 2, left by 7 or 6 positions, to make them most significant (_mm_sll_epi16. There is no epi8 version, but bits from one element becoming garbage in the low bits of another element isn't important for this.)

  • interleave them (_mm_unpack..._epi8, once with lo and once with hi)

  • store the high bits of those two vectors into shorts with _mm_movemask_epi8.

For the binary-to-string conversion, I can't think of an SSE implementation that makes sense, as there is no counterpart of _mm_movemask_epi8 that would allow you to unpack efficiently.

Upvotes: 2

Related Questions