Jan Schultke
Jan Schultke

Reputation: 39879

How do you compute the bitwise exclusive prefix parity on ARM Neon?

I have a certain function that I need to make portable and efficient. Here is the naive implementation, just for reference:

template <unsigned_integral T>
constexpr T bitwise_exclusive_prefix_parity_naive(T x)
{
    constexpr int N = std::numeric_limits<T>::digits;

    T result = 0;
    bool parity = false;
    for (int i = 0; i < N; ++i) {
        result |= static_cast<T>(parity) << i;
        parity ^= (x >> i) & 1;
    }

    return result;
}

static_assert(bitwise_exclusive_prefix_parity_naive(0b0000'0000u) == 0b0000'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1111'1111u) == 0b1010'1010);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1001'0000u) == 0b1110'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b0100'1000u) == 0b0111'0000);

In short, this function computes the parity of the bits strictly to the right, for each bit.

This naive O(n) implementation can be beaten using a variety of techniques. Most notably, it is equivalent to CLMUL(x, -2) where CLMUL is a carry-less multiplication. Here is my progress so far:

#include <x86intrin.h> // only on x86
#include <arm_neon.h>  // only on ARM

template <std::unsigned_integral T>
constexpr T bitwise_exclusive_prefix_parity(T x)
{
    constexpr int N = std::numeric_limits<T>::digits;

#ifdef __PCLMUL__ // x86
    if !consteval {
        if constexpr (N <= 64) {
            const __m128i x_128 = _mm_set_epi64x(0, x);
            const __m128i neg2_128 = _mm_set_epi64x(0, -2);
            const __m128i result_128 = _mm_clmulepi64_si128(x_128, neg2_128, 0);
            return _mm_extract_epi64(result_128, 0) & T(-1);
        }
    }
#endif
    x <<= 1;
    for (int i = 1; i < N; i <<= 1) {
        x ^= x << i;
    }
    return x;
}

The x86 version appears to work correctly. I want to add a case for ARM Neon.

It should not use SVE2 because I don't need this function at all for SVE2. The low hanging fruit would be to use vmull_p64 and do essentially what the x86 version does, but I've heard that the 64-to-128-bit version is optional and you'd have to check at run-time whether it is available.

Questions

Upvotes: 2

Views: 139

Answers (1)

solidpixel
solidpixel

Reputation: 12229

The NEON code using vmull_p64() code looks something like this:

poly64_t a = vget_lane_p64(vreinterpret_p64_u64(vcreate_u64(x)), 0);
poly64_t b = vget_lane_p64(vreinterpret_p64_u64(vcreate_u64(-2)), 0);
uint64x2_t result = vreinterpretq_u64_p128(vmull_p64(a, b));
return vgetq_lane_s64(result, 0) & (-1);

Running 200M iterations for a uint64_t type on an Apple M1 is 0.9s for NEON, vs 12.9s for the naive version.

The vmull_p64 intrinsic is only available if the CPU implements the crypto extensions. It's widely available, but not universally so, so a feature check is required.

Upvotes: 4

Related Questions