Reputation: 111
I am new to AVX2 and SSE2 instruction sets, and I want to learn more on how to use such instruction sets to speed-up bit vector operations.
So far I have used them successfully to vectorize the codes with double / float operations.
In this example, I have a C++ code that checks a condition before to set or not a bit in a bit vector (using unsigned int) to a specific value:
int process_bit_vetcor(unsigned int *bitVector, float *value, const float threshold, const unsigned int dim)
{
int sum = 0, cond = 0;
for (unsigned int i = 0; i < dim; i++) {
unsigned int *word = bitVector + i / 32;
unsigned int bitValue = ((unsigned int)0x80000000 >> (i & 0x1f));
cond = (value[i] <= threshold);
(*word) = (cond) ? (*word) | bitValue : (*word);
sum += cond;
}
return sum;
}
The variable sum just returns the number of cases where the condition is TRUE.
I tried to rewrite this routine with SSE2 and AVX2 but it didn't work out... :-(
Is it possible to rewrite such C++ code using AVX2 and SSE2? Is it worth to use vectorization for such type of bit operations? The bit vector could contain many thousands of bits so I hope it could be interesting to use SSE2 and AVX2 to speed-up.
Thanks in advance!
Upvotes: 3
Views: 433
Reputation: 18827
The following should work, if dim
is a multiple of 8 (to handle the remainder, add a trivial loop at the end). Minor API-changes:
long
instead of unsigned int
for loop indices (this helps clang unrolling the loop)bitvector
is little-endian (as suggested in the comments)Inside the loop, bitVector
is accessed byte-wise. It might be worth to combine 2 or 4 results of movemask
and bit-or them at once (probably depends on the target architecture).
To calculate the sum
, 8 partial sums are calculated directly from the result of the cmp_ps
operation. Since you need the bitmask anyway, it may be worth to use popcnt
(ideally after combining 2, 4, or 8 bytes together -- again, this probably depends on your target architecture).
int process_bit_vector(uint32_t *bitVector32, float *value,
const float threshold_float, const long dim) {
__m256i sum = _mm256_setzero_si256();
__m256 threshold_vector = _mm256_set1_ps(threshold_float);
uint8_t *bitVector8 = (uint8_t *)bitVector32;
for (long i = 0; i <= dim-8; i += 8) {
// compare next 8 values with threshold
// (use threshold as first operand to allow loading other operand from memory)
__m256 cmp_mask = _mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i), _CMP_GE_OQ);
// true values are `-1` when interpreted as integers, subtract those from `sum`
sum = _mm256_sub_epi32(sum, _mm256_castps_si256(cmp_mask));
// extract bitmask
int mask = _mm256_movemask_ps(cmp_mask);
// bitwise-or current mask with result bit-vector
*bitVector8++ |= mask;
}
// reduce 8 partial sums to a single sum and return
__m128i sum_reduced = _mm_add_epi32(_mm256_castsi256_si128(sum), _mm256_extracti128_si256(sum,1));
sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 8));
sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 4));
return _mm_cvtsi128_si32(sum_reduced);
}
Godbolt-Link: https://godbolt.org/z/ABwDPe
vpsubd ymm2, ymm0, ymm1; vmovdqa ymm0, ymm2;
instead of just vpsubd ymm0, ymm0, ymm1
.load
with the vcmpps
(and uses LE
instead of GE
comparison) -- if you don't care about how NaNs are handled, you could use _CMP_NLT_US
instead of _CMP_GE_OQ
.Revised version with big-endian output (untested):
int process_bit_vector(uint32_t *bitVector32, float *value,
const float threshold_float, const long dim) {
int sum = 0;
__m256 threshold_vector = _mm256_set1_ps(threshold_float);
for (long i = 0; i <= dim-32; i += 32) {
// compare next 4x8 values with threshold
// (use threshold as first operand to allow loading other operand from memory)
__m256i cmp_maskA = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 0), _CMP_GE_OQ));
__m256i cmp_maskB = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 8), _CMP_GE_OQ));
__m256i cmp_maskC = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+16), _CMP_GE_OQ));
__m256i cmp_maskD = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+24), _CMP_GE_OQ));
__m256i cmp_mask = _mm256_packs_epi16(
_mm256_packs_epi16(cmp_maskA,cmp_maskB), // b7b7b6b6'b5b5b4b4'a7a7a6a6'a5a5a4a4 b3b3b2b2'b1b1b0b0'a3a3a2a2'a1a1a0a0
_mm256_packs_epi16(cmp_maskC,cmp_maskD) // d7d7d6d6'd5d5d4d4'c7c7c6c6'c5c5c4c4 d3d3d2d2'd1d1d0d0'c3c3c2c2'c1c1c0c0
); // cmp_mask = d7d6d5d4'c7c6c5c4'b7b6b5b4'a7a6a5a4 d3d2d1d0'c3c2c1c0'b3b2b1b0'a3a2a1a0
cmp_mask = _mm256_permute4x64_epi64(cmp_mask, 0x8d);
// cmp_mask = [b7b6b5b4'a7a6a5a4 b3b2b1b0'a3a2a1a0 d7d6d5d4'c7c6c5c4 d3d2d1d0'c3c2c1c0]
__m256i shuff_idx = _mm256_broadcastsi128_si256(_mm_set_epi64x(0x00010203'08090a0b,0x04050607'0c0d0e0f));
cmp_mask = _mm256_shuffle_epi8(cmp_mask, shuff_idx);
// extract bitmask
uint32_t mask = _mm256_movemask_epi8(cmp_mask);
sum += _mm_popcnt_u32 (mask);
// bitwise-or current mask with result bit-vector
*bitVector32++ |= mask;
}
return sum;
}
The idea is to shuffle the bytes before applying a vpmovmskb
on it. This takes 5 shuffle operations (including the 3 vpacksswb
) for 32 input values, but computation of the sum is done using a popcnt
instead of 4 vpsubd
. The vpermq
(_mm256_permute4x64_epi64
) could probably be avoided by strategically loading 128 bit halves into 256 bit vectors before comparing them. Another idea (since you need to shuffle the final result anyway) would be to blend together partial results (this tends to require p5
or 2*p015
on architectures I checked, so probably not worth it).
Upvotes: 2