Reputation: 111
I am trying to speed-up a bitwise OR operation for very long binary vectors using integers of 32 bit.
In this example we can assume that nwords is the number of words and it is a multiple of 4 and 8. Hence, no loop reminder. This binary vector can contain many thousands of bits.
Moreover all three bit vectors are allocated using _align_malloc() with alignment at 16 and 18 bits, for SSE2 and AVX2, respectively.
To my surprise, the following three scalar, SSE2 and AVX2 codes executed using exactly the same amount of time on my i7 CPU. I didn't experience the expected x4 and x8 speed-up of SSE2 and AVX2 registers.
My MVisual Studio verson is 15.1.
Scalar code:
void vectorOr_Scalar(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
for (end = ptr1 + nwords; ptr1 < end; ptr1++, ptr2++, out++) *out = *ptr1 | *ptr2;
}
SSE2 code:
void vectorOr_SSE2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
for (i = 0; i < nwords; i += 4, ptr1 += 4, ptr2 += 4, out += 4)
{
__m128i v1 = _mm_load_si128((__m128i *)ptr1);
__m128i v2 = _mm_load_si128((__m128i *)ptr2);
_mm_store_si128((__m128i *)out, _mm_or_si128(v1, v2));
}
}
AVX2 code:
void vectorOr_AVX2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
for (i = 0; i < nwords; i += 8, ptr1 += 8, ptr2 += 8, out += 8)
{
__m256i v1 = _mm256_load_si256((__m256i *)ptr1);
__m256i v2 = _mm256_load_si256((__m256i *)ptr2);
_mm256_store_si256((__m256i *)out, _mm256_or_si256(v1, v2));
}
}
Perhaps is this application not fitting well for vectorization due to the limited amount of register operations between loads and stores?
Upvotes: 0
Views: 987
Reputation: 136435
The reason you don't observe performance difference between the loop that processes one unsigned
at a time and a SIMD loop that processes 8 unsigned
at a time is because the compilers generate SIMD code for you, as well as unroll the loop, see the generated assembly.
Upvotes: 1