Emulating shifts on 64 bytes with AVX-512

Question

My question is an extension of a previous question: Emulating shifts on 32 bytes with AVX.

How do I implement similar shifts on 64 bytes with AVX-512? Specifically, how should I implement:

__m512i _mm512_slli_si512(__m512i a, int imm8)
__m512i _mm512_srli_si512(__m512i a, int imm8)

Corrosponding to the SSE2 methods _mm_slli_si128 and _mm_srli_si128.

chtz · Accepted Answer

Here is a working solution using a temporary array:

__m512i _mm512_slri_si512(__m512i a, size_t imm8)
{
    // set up temporary array and set upper half to zero 
    // (this needs to happen outside any critical loop)
    alignas(64) char temp[128];
    _mm512_store_si512(temp+64, _mm512_setzero_si512());

    // store input into lower half
    _mm512_store_si512(temp, a);

    // load shifted register
    return _mm512_loadu_si512(temp+imm8);
}

__m512i _mm512_slli_si512(__m512i a, size_t imm8)
{
    // set up temporary array and set lower half to zero 
    // (this needs to happen outside any critical loop)
    alignas(64) char temp[128];
    _mm512_store_si512(temp, _mm512_setzero_si512());

    // store input into upper half
    _mm512_store_si512(temp+64, a);

    // load shifted register
    return _mm512_loadu_si512(temp+(64-imm8));
}

This should also work if imm8 was not known at compile time, but it does not do any out-of-bounds checks. You could actually use a 3*64 temporary and share it between the left and right shift methods (and both would work for negative inputs as well).

Of course, if you share a temporary outside the function body, you must make sure that it is not accessed by multiple threads at once.

Godbolt-Link with usage demonstration: https://godbolt.org/z/LSgeWZ

As Peter noted, this store-load trick will cause a store-forwarding stall on all CPUs with AVX512. The most-efficient forwarding case (~6 cycle latency) only works when all the load bytes come from one store. If the load goes outside the most recent store that overlaps it at all, it has extra latency (like ~16 cycles) to scan the store buffer and if needed merge in bytes from L1d cache. See Can modern x86 implementations store-forward from more than one prior store? and Agner Fog's microarch guidefor more details. This extra-scanning process can probably be happening for multiple loads in parallel, and at least doesn't stall other things (like normal store-forwarding or the rest of the pipeline), so it may not be a throughput problem.

If you want many shift offsets of the same data, one store and multiple reloads at different alignments should be good.

But if latency is your primary issue you should try a solution based on valignd (also, if you want to shift by a multiple of 4 bytes that is obviously an easier solution). Or for constant shift-counts, a vector control for vpermw could work.

For completeness, here is a version based on valignd and valignr working for shifts from 0 to 64, known at compile-time (using C++17 -- but you can easily avoid the if constexpr this is only here because of the static_assert). Instead of shifting in zeros you can pass a second register (i.e., it behaves like valignr would behave if it would align across lanes).

template
__m512i shift_right(__m512i a, __m512i carry = _mm512_setzero_si512())
{
  static_assert(0 <= N && N <= 64);
  if constexpr(N   == 0) return a;
  if constexpr(N   ==64) return carry;
  if constexpr(N%4 == 0) return _mm512_alignr_epi32(carry, a, N / 4);
  else
  {
    __m512i a0 = shift_right< (N/16 + 1)*16>(a, carry);  // 16, 32, 48, 64
    __m512i a1 = shift_right< (N/16    )*16>(a, carry);  //  0, 16, 32, 48
    return _mm512_alignr_epi8(a0, a1, N % 16);
  }
}

template
__m512i shift_left(__m512i a, __m512i carry = _mm512_setzero_si512())
{
  return shift_right<64-N>(carry, a);
}

Here is a godbolt-link with some example assembly as well as output for every possible shift_right operation: https://godbolt.org/z/xmKJvA

GCC faithfully translates this into valignd and valignr instructions -- but may do an unnecessary vpxor instruction (e.g. in the shiftleft_49 example), Clang does some crazy substitutions (not sure if they actually make a difference, though).

The code could be extended to shift an arbitrary sequence of registers (always carrying bytes from the previous register).

Emulating shifts on 64 bytes with AVX-512

Answers (2)

Related Questions