Reputation: 111

Bit vector operation with AVX2 and SSE2

I am new to AVX2 and SSE2 instruction sets, and I want to learn more on how to use such instruction sets to speed-up bit vector operations.

So far I have used them successfully to vectorize the codes with double / float operations.

In this example, I have a C++ code that checks a condition before to set or not a bit in a bit vector (using unsigned int) to a specific value:

int process_bit_vetcor(unsigned int *bitVector, float *value, const float threshold, const unsigned int dim)
{
       int sum = 0, cond = 0;

       for (unsigned int i = 0; i < dim; i++) {
            unsigned int *word = bitVector + i / 32;
            unsigned int bitValue = ((unsigned int)0x80000000 >> (i & 0x1f));
            cond = (value[i] <= threshold);
            (*word) = (cond) ? (*word) | bitValue : (*word);
            sum += cond;
        }

        return sum;
}

The variable sum just returns the number of cases where the condition is TRUE.

I tried to rewrite this routine with SSE2 and AVX2 but it didn't work out... :-(

Is it possible to rewrite such C++ code using AVX2 and SSE2? Is it worth to use vectorization for such type of bit operations? The bit vector could contain many thousands of bits so I hope it could be interesting to use SSE2 and AVX2 to speed-up.

Thanks in advance!

Upvotes: 3

Answers (1)

chtz

Reputation: 18827

The following should work, if dim is a multiple of 8 (to handle the remainder, add a trivial loop at the end). Minor API-changes:

Use long instead of unsigned int for loop indices (this helps clang unrolling the loop)
Assume bitvector is little-endian (as suggested in the comments)

Inside the loop, bitVector is accessed byte-wise. It might be worth to combine 2 or 4 results of movemask and bit-or them at once (probably depends on the target architecture).

To calculate the sum, 8 partial sums are calculated directly from the result of the cmp_ps operation. Since you need the bitmask anyway, it may be worth to use popcnt (ideally after combining 2, 4, or 8 bytes together -- again, this probably depends on your target architecture).

int process_bit_vector(uint32_t *bitVector32, float *value,
                       const float threshold_float, const long dim) {
  __m256i sum = _mm256_setzero_si256();
  __m256 threshold_vector = _mm256_set1_ps(threshold_float);
  uint8_t *bitVector8 = (uint8_t *)bitVector32;

  for (long i = 0; i <= dim-8; i += 8) {
    // compare next 8 values with threshold
    // (use threshold as first operand to allow loading other operand from memory)
    __m256 cmp_mask = _mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i), _CMP_GE_OQ);
    // true values are `-1` when interpreted as integers, subtract those from `sum`
    sum = _mm256_sub_epi32(sum, _mm256_castps_si256(cmp_mask));
    // extract bitmask
    int mask = _mm256_movemask_ps(cmp_mask);
    // bitwise-or current mask with result bit-vector
    *bitVector8++ |= mask;
  }

  // reduce 8 partial sums to a single sum and return
  __m128i sum_reduced = _mm_add_epi32(_mm256_castsi256_si128(sum), _mm256_extracti128_si256(sum,1));
  sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 8));
  sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 4));

  return _mm_cvtsi128_si32(sum_reduced);
}

Godbolt-Link: https://godbolt.org/z/ABwDPe

For some reason GCC does vpsubd ymm2, ymm0, ymm1; vmovdqa ymm0, ymm2; instead of just vpsubd ymm0, ymm0, ymm1.
Clang fails to join the load with the vcmpps (and uses LE instead of GE comparison) -- if you don't care about how NaNs are handled, you could use _CMP_NLT_US instead of _CMP_GE_OQ.

Revised version with big-endian output (untested):

int process_bit_vector(uint32_t *bitVector32, float *value,
                       const float threshold_float, const long dim) {
  int sum = 0;
  __m256 threshold_vector = _mm256_set1_ps(threshold_float);

  for (long i = 0; i <= dim-32; i += 32) {
    // compare next 4x8 values with threshold
    // (use threshold as first operand to allow loading other operand from memory)
    __m256i cmp_maskA = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 0), _CMP_GE_OQ));
    __m256i cmp_maskB = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 8), _CMP_GE_OQ));
    __m256i cmp_maskC = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+16), _CMP_GE_OQ));
    __m256i cmp_maskD = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+24), _CMP_GE_OQ));

    __m256i cmp_mask = _mm256_packs_epi16(
        _mm256_packs_epi16(cmp_maskA,cmp_maskB), // b7b7b6b6'b5b5b4b4'a7a7a6a6'a5a5a4a4 b3b3b2b2'b1b1b0b0'a3a3a2a2'a1a1a0a0
        _mm256_packs_epi16(cmp_maskC,cmp_maskD)  // d7d7d6d6'd5d5d4d4'c7c7c6c6'c5c5c4c4 d3d3d2d2'd1d1d0d0'c3c3c2c2'c1c1c0c0
    );                                // cmp_mask = d7d6d5d4'c7c6c5c4'b7b6b5b4'a7a6a5a4 d3d2d1d0'c3c2c1c0'b3b2b1b0'a3a2a1a0

    cmp_mask = _mm256_permute4x64_epi64(cmp_mask, 0x8d);
                // cmp_mask = [b7b6b5b4'a7a6a5a4 b3b2b1b0'a3a2a1a0  d7d6d5d4'c7c6c5c4 d3d2d1d0'c3c2c1c0]
    __m256i shuff_idx = _mm256_broadcastsi128_si256(_mm_set_epi64x(0x00010203'08090a0b,0x04050607'0c0d0e0f));
    cmp_mask = _mm256_shuffle_epi8(cmp_mask, shuff_idx);

    // extract bitmask
    uint32_t mask = _mm256_movemask_epi8(cmp_mask);
    sum += _mm_popcnt_u32 (mask);
    // bitwise-or current mask with result bit-vector
    *bitVector32++ |= mask;
  }

  return sum;
}

The idea is to shuffle the bytes before applying a vpmovmskb on it. This takes 5 shuffle operations (including the 3 vpacksswb) for 32 input values, but computation of the sum is done using a popcnt instead of 4 vpsubd. The vpermq (_mm256_permute4x64_epi64) could probably be avoided by strategically loading 128 bit halves into 256 bit vectors before comparing them. Another idea (since you need to shuffle the final result anyway) would be to blend together partial results (this tends to require p5 or 2*p015 on architectures I checked, so probably not worth it).

Upvotes: 2

Bit vector operation with AVX2 and SSE2

Answers (1)

Related Questions