Why AVX2 and SSE2 bitwise OR operators are not faster than a simple | operator?

Question

I am trying to speed-up a bitwise OR operation for very long binary vectors using integers of 32 bit.

In this example we can assume that nwords is the number of words and it is a multiple of 4 and 8. Hence, no loop reminder. This binary vector can contain many thousands of bits.

Moreover all three bit vectors are allocated using _align_malloc() with alignment at 16 and 18 bits, for SSE2 and AVX2, respectively.

To my surprise, the following three scalar, SSE2 and AVX2 codes executed using exactly the same amount of time on my i7 CPU. I didn't experience the expected x4 and x8 speed-up of SSE2 and AVX2 registers.

My MVisual Studio verson is 15.1.

Scalar code:

void vectorOr_Scalar(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (end = ptr1 + nwords; ptr1 < end; ptr1++, ptr2++, out++) *out = *ptr1 | *ptr2;
}

SSE2 code:

void vectorOr_SSE2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (i = 0; i < nwords; i += 4, ptr1 += 4, ptr2 += 4, out += 4)
    {
        __m128i v1 = _mm_load_si128((__m128i *)ptr1);
        __m128i v2 = _mm_load_si128((__m128i *)ptr2);
        _mm_store_si128((__m128i *)out, _mm_or_si128(v1, v2));
    }
}

AVX2 code:

void vectorOr_AVX2(unsigned int *ptr1, unsigned int *ptr2, unsigned int *out, int nwords)
{
    for (i = 0; i < nwords; i += 8, ptr1 += 8, ptr2 += 8, out += 8)
    {
        __m256i v1 = _mm256_load_si256((__m256i *)ptr1);
        __m256i v2 = _mm256_load_si256((__m256i *)ptr2);
        _mm256_store_si256((__m256i *)out, _mm256_or_si256(v1, v2));
    }
}

Perhaps is this application not fitting well for vectorization due to the limited amount of register operations between loads and stores?

Why AVX2 and SSE2 bitwise OR operators are not faster than a simple | operator?

Answers (1)

Related Questions