How do you compute the bitwise exclusive prefix parity on ARM Neon?

Question

I have a certain function that I need to make portable and efficient. Here is the naive implementation, just for reference:

template 
constexpr T bitwise_exclusive_prefix_parity_naive(T x)
{
    constexpr int N = std::numeric_limits::digits;

    T result = 0;
    bool parity = false;
    for (int i = 0; i < N; ++i) {
        result |= static_cast(parity) << i;
        parity ^= (x >> i) & 1;
    }

    return result;
}

static_assert(bitwise_exclusive_prefix_parity_naive(0b0000'0000u) == 0b0000'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1111'1111u) == 0b1010'1010);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1001'0000u) == 0b1110'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b0100'1000u) == 0b0111'0000);

In short, this function computes the parity of the bits strictly to the right, for each bit.

This naive O(n) implementation can be beaten using a variety of techniques. Most notably, it is equivalent to CLMUL(x, -2) where CLMUL is a carry-less multiplication. Here is my progress so far:

#include  // only on x86
#include   // only on ARM

template 
constexpr T bitwise_exclusive_prefix_parity(T x)
{
    constexpr int N = std::numeric_limits::digits;

#ifdef __PCLMUL__ // x86
    if !consteval {
        if constexpr (N <= 64) {
            const __m128i x_128 = _mm_set_epi64x(0, x);
            const __m128i neg2_128 = _mm_set_epi64x(0, -2);
            const __m128i result_128 = _mm_clmulepi64_si128(x_128, neg2_128, 0);
            return _mm_extract_epi64(result_128, 0) & T(-1);
        }
    }
#endif
    x <<= 1;
    for (int i = 1; i < N; i <<= 1) {
        x ^= x << i;
    }
    return x;
}

The x86 version appears to work correctly. I want to add a case for ARM Neon.

It should not use SVE2 because I don't need this function at all for SVE2. The low hanging fruit would be to use vmull_p64 and do essentially what the x86 version does, but I've heard that the 64-to-128-bit version is optional and you'd have to check at run-time whether it is available.

Questions

How would you implement this with vmull_p64, and is a run-time check necessary?
If it is, what's the next best fallback implementation without SVE2?

How do you compute the bitwise exclusive prefix parity on ARM Neon?

Questions

Answers (1)

Related Questions