Reputation: 39879
I have a certain function that I need to make portable and efficient. Here is the naive implementation, just for reference:
template <unsigned_integral T>
constexpr T bitwise_exclusive_prefix_parity_naive(T x)
{
constexpr int N = std::numeric_limits<T>::digits;
T result = 0;
bool parity = false;
for (int i = 0; i < N; ++i) {
result |= static_cast<T>(parity) << i;
parity ^= (x >> i) & 1;
}
return result;
}
static_assert(bitwise_exclusive_prefix_parity_naive(0b0000'0000u) == 0b0000'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1111'1111u) == 0b1010'1010);
static_assert(bitwise_exclusive_prefix_parity_naive(0b1001'0000u) == 0b1110'0000);
static_assert(bitwise_exclusive_prefix_parity_naive(0b0100'1000u) == 0b0111'0000);
In short, this function computes the parity of the bits strictly to the right, for each bit.
This naive O(n) implementation can be beaten using a variety of techniques.
Most notably, it is equivalent to CLMUL(x, -2)
where CLMUL
is a carry-less multiplication.
Here is my progress so far:
#include <x86intrin.h> // only on x86
#include <arm_neon.h> // only on ARM
template <std::unsigned_integral T>
constexpr T bitwise_exclusive_prefix_parity(T x)
{
constexpr int N = std::numeric_limits<T>::digits;
#ifdef __PCLMUL__ // x86
if !consteval {
if constexpr (N <= 64) {
const __m128i x_128 = _mm_set_epi64x(0, x);
const __m128i neg2_128 = _mm_set_epi64x(0, -2);
const __m128i result_128 = _mm_clmulepi64_si128(x_128, neg2_128, 0);
return _mm_extract_epi64(result_128, 0) & T(-1);
}
}
#endif
x <<= 1;
for (int i = 1; i < N; i <<= 1) {
x ^= x << i;
}
return x;
}
The x86 version appears to work correctly. I want to add a case for ARM Neon.
It should not use SVE2 because I don't need this function at all for SVE2.
The low hanging fruit would be to use vmull_p64
and do essentially what the x86 version does, but I've heard that the 64-to-128-bit version is optional and you'd have to check at run-time whether it is available.
vmull_p64
, and is a run-time check necessary?Upvotes: 2
Views: 139
Reputation: 12229
The NEON code using vmull_p64()
code looks something like this:
poly64_t a = vget_lane_p64(vreinterpret_p64_u64(vcreate_u64(x)), 0);
poly64_t b = vget_lane_p64(vreinterpret_p64_u64(vcreate_u64(-2)), 0);
uint64x2_t result = vreinterpretq_u64_p128(vmull_p64(a, b));
return vgetq_lane_s64(result, 0) & (-1);
Running 200M iterations for a uint64_t
type on an Apple M1 is 0.9s for NEON, vs 12.9s for the naive version.
The vmull_p64
intrinsic is only available if the CPU implements the crypto extensions. It's widely available, but not universally so, so a feature check is required.
Upvotes: 4