_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

Question

I try to re-write the following uint64_t 2x2 matrix multiplication with AVX-512 instructions, but GCC 10.3 does not found _mm256_rem_epu64 intrinsic.

#include 
#include 

constexpr uint32_t LAST_9_DIGITS_DIVIDER = 1000000000;

void multiply(uint64_t f[2][2], uint64_t m[2][2])
{
  uint64_t x = (f[0][0] * m[0][0] + f[0][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
  uint64_t y = (f[0][0] * m[0][1] + f[0][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
  uint64_t z = (f[1][0] * m[0][0] + f[1][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
  uint64_t w = (f[1][0] * m[0][1] + f[1][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;

  f[0][0] = x;
  f[0][1] = y;
  f[1][0] = z;
  f[1][1] = w;
}

void multiply_simd(uint64_t f[2][2], uint64_t m[2][2])
{
  __m256i v1 = _mm256_set_epi64x(f[0][0], f[0][0], f[1][0], f[1][0]);
  __m256i v2 = _mm256_set_epi64x(m[0][0], m[0][1], m[0][0], m[0][1]);
  __m256i v3 = _mm256_mullo_epi64(v1, v2);

  __m256i v4 = _mm256_set_epi64x(f[0][1], f[0][1], f[1][1], f[1][1]);
  __m256i v5 = _mm256_set_epi64x(m[1][0], m[1][1], m[1][0], m[1][1]);
  __m256i v6 = _mm256_mullo_epi64(v4, v5);

  __m256i v7 = _mm256_add_epi64(v3, v6);
  __m256i div = _mm256_set1_epi64x(LAST_9_DIGITS_DIVIDER);
  __m256i v8 = _mm256_rem_epu64(v7, div);
  _mm256_store_epi64(f, v8);
}

Is it possible somehow to enable _mm256_rem_epu64 or if not, some other way to calculate the reminder with SIMD instructions?

nemequ · Accepted Answer

As Peter Cordes mentioned in the comments, _mm256_rem_epu64 is an SVML function. Most compilers don't support SVML; AFAIK really only ICC does, but clang can be configured to use it too.

The only other implementation of SVML I'm aware of is in one of my projects, SIMDe. In this case, since you're using GCC 10.3, the implementation of _mm256_rem_epu64 will use vector extensions, so the code from SIMDe is going to be basically the same as something like:

#include 
#include 

typedef uint64_t u64x4 __attribute__((__vector_size__(32)));

__m256i
foo_mm256_rem_epu64(__m256i a, __m256i b) {
    return (__m256i) (((u64x4) a) % ((u64x4) b));
}

In this case, both GCC and clang will scalarize the operation (see Compiler Explorer), so performance is going to be pretty bad, especially considering how slow the div instruction is.

That said, since you're using a compile-time constant, the compiler should be able to replace the division with a multiplication and a shift, so performance will be better, but we can squeeze out some more by using libdivide.

Libdivide usually computes the the magic value at runtime, but the libdivide_u64_t structure is very simple and we can just skip the libdivide_u64_gen step and provide the struct at compile time:

__m256i div_by_1000000000(__m256i a) {
    static const struct libdivide_u64_t d = {
        UINT64_C(1360296554856532783),
        UINT8_C(93)
    };
    return libdivide_u64_do_vec256(a, &d);
}

Now, if you can use AVX-512VL + AVX-512DQ there is a 64-bit multiplication function (_mm256_mullo_epi64). If you can use that it's probably the right way to go:

__m256i rem_1000000000(__m256i a) {
    static const struct libdivide_u64_t d = {
        UINT64_C(1360296554856532783),
        UINT8_C(93)
    };
    return
        _mm256_sub_epi64(
            a,
            _mm256_mullo_epi64(
                libdivide_u64_do_vec256(a, &d),
                _mm256_set1_epi64x(1000000000)
            )
        );
}

(or on Compiler Explorer, with LLVM-MCA)

If you don't have AVX-512DQ+VL, you'll probably want to fall back on vector extensions again:

typedef uint64_t u64x4 __attribute__((__vector_size__(32)));

__m256i rem_1000000000(__m256i a) {
    static const struct libdivide_u64_t d = {
        UINT64_C(1360296554856532783),
        UINT8_C(93)
    };
    u64x4 one_billion = { 1000000000, 1000000000, 1000000000, 1000000000 };
    return (__m256i) (
        (
            (u64x4) a) -
            (((u64x4) libdivide_u64_do_vec256(a, &d)) * one_billion
        )
    );
}

(on Compiler Explorer)

All this is untested, but assuming I haven't made any stupid mistakes it should be relatively snappy.

If you really want to get rid of the libdivide dependency you could perform those operations yourself, but I don't really see any good reason not to use libdivide so I'll leave that as an exercise for someone else.

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

Answers (1)

Related Questions