g7573025
g7573025

Reputation: 33

Calculating min of 8 long ints using AVX2

I was try trying to find the min of 8 long ints using AVX2. I am a greenie for SIMD programming and I have no idea where to start. I did not see any post/example which explains how to carry out min and max in AVX2. I know that I cannot exceed more than 4 long ints due to the 256 bit limit, but I can solve my problem using three steps . Also I cannot figure out how to load the data of an already existing normal long int array into vectors for avx2.

I know the idea behind the process, This is what I am trying to achieve

long int nums = {1 , 2, 3 , 4 , 5 , 6 , 7, 8}
a = min(1,2) ; b = min(3,4) ; c = min(5,6) ; d = min(7,8)
x = min(a,b) ; y = min(c,d)
answer  = min(x,y)

Can someone help me out as to how to get this to work. Also the last min is a single operation , is it better to do it on the CPU? Should I use something else other than AVX2? ( I am on a x86 system)

Upvotes: 3

Views: 3082

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364180

For x86 optimization and so on, see the links on https://stackoverflow.com/tags/x86/info. Esp. Intel's intrinsics guide, and Agner Fog's stuff.

If you always have exactly 8 elements (64 bytes), that simplifies things a lot. One of the major challenges when vectorizing small stuff is to not add too much startup/cleanup overhead handling the leftover elements that don't fill a whole vector.

AVX2 doesn't have min/max instructions for packed 64bit ints. Only 8, 16, and 32. That means you need to emulate it with a compare that generates a mask (all-0s for elements where the condition is false, all-1s where true, so you can AND this mask to zero out elements in other vectors.) To save on actually doing the AND/ANDN and OR operations to combine things with the mask, there are blend instructions.

AVX-512 will bring a big speedup for this operation. (support coming in (xeon-only) Skylake). It has a _mm_min_epi64. There's also a library function for this operation: __int64 _mm512_reduce_min_epi64 (__m512i a). I assume this intrinsic will emit a sequence of vpminsq instructions. Intel lists it in their intrinsic finder, but it's just an Intel library function, not a machine instruction.

Here's an AVX2 implementation that should work. I haven't tested it, but the compiled output looks like the right instruction sequence. I may have gotten a comparison reversed in there somewhere, so check it.

The principle of operation is: get the elementwise min of two 256b vectors. Split that into two 128b vectors and get the elementwise min of that. Then take that vector of two 64b values back to GP registers and do the final min. Max is done at the same time, interleaved with the min.

(Oops, you mentioned min/max in your question, but now I see you only actually just wanted min. Removing the un-needed parts is trivial, and you can change it to a return value instead of storing results through pointers/references. A scalar version might be faster; better test in the context of where your app uses this operation (not a standalone microbenchmark).)

#include <stdint.h>
#include <immintrin.h>

int64_t input[8] = { 1, 2, 3, };

#define min(a,b) \
   ({ __typeof__ (a) _a = (a); __typeof__ (b) _b = (b); \
     _a < _b ? _a : _b; })

#define max(a,b) \
   ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
     _a > _b ? _a : _b; })

// put this where it can get inlined.  You don't want to actually store the results to RAM
// or have the compiler-generated VZEROUPPER at the end for every use.
void minmax64(int64_t input[8], int64_t *minret, int64_t *maxret)
{
    __m256i *in_vec = (__m256i*)input;
    __m256i v0 = in_vec[0], v1=in_vec[1];  // _mm256_loadu_si256 is optional for AVX

    __m256i gt = _mm256_cmpgt_epi64(v0, v1); // 0xff.. for elements where v0 > v1.  0 elsewhere
    __m256i minv = _mm256_blendv_epi8(v0, v1, gt);  // take bytes from v1 where gt=0xff (i.e. where v0>v1)
    __m256i maxv = _mm256_blendv_epi8(v1, v0, gt);  // input order reversed

    /* for 8, 16, or 32b:  cmp/blend isn't needed
       minv = _mm256_min_epi32(v0,v1);
       maxv = _mm256_min_epi32(v0,v1);  // one insn shorter, but much faster (esp. latency)
       And at the stage of having a 128b vectors holding the min and max candidates,
       you'd shuffle and repeat to get the low 64, and optionally again for the low 32,
       before extracting to GP regs to finish the comparisons.
     */

    __m128i min0 = _mm256_castsi256_si128(minv); // stupid gcc 4.9.2 compiles this to a vmovdqa
    __m128i min1 = _mm256_extracti128_si256(minv, 1);  // extracti128(x, 0) should optimize away to nothing.

    __m128i max0 = _mm256_castsi256_si128(maxv);
    __m128i max1 = _mm256_extracti128_si256(maxv, 1);

    __m128i gtmin = _mm_cmpgt_epi64(min0, min1);
    __m128i gtmax = _mm_cmpgt_epi64(max0, max1);
    min0 = _mm_blendv_epi8(min0, min1, gtmin);
    max0 = _mm_blendv_epi8(max1, max0, gtmax);

    int64_t tmp0 = _mm_cvtsi128_si64(min0);    // tmp0 = max0.m128i_i64[0];  // MSVC only
    int64_t tmp1 = _mm_extract_epi64(min0, 1);
    *minret = min(tmp0, tmp1);  // compiles to a quick cmp / cmovg of 64bit GP registers

    tmp0 = _mm_cvtsi128_si64(max0);
    tmp1 = _mm_extract_epi64(max0, 1);
    *maxret = min(tmp0, tmp1);
}

This may or may not be faster than doing the whole thing in GP registers, since 64bit load is one uop, cmp is one uop, and cmovcc is only 2 uops (on Intel). Haswell can issue 4 uops per cycles. Until you get to the bottom of the compare tree, there's lots of independent work to do, and even so, cmp is 1 cycle latency, and cmov is 2. If you're interleaving the work for a min and a max at the same time, there's two separate dependency chains (or trees in this case).

The vector version has much higher latency than throughput. If you need this operation on multiple independent sets of 8 values, the vector version is probably going to do well. Otherwise, the 5 cycle latency of pcmpgt*, and 2 cycle latency of blendv is going to hurt. If there is other independent work that can be happening in parallel, then that's fine.

If you had smaller integers, pmin* (signed or unsigned, 8, 16, or 32b) is 1 cycle latency, 2 per cycle throughput. For 16b unsigned elements only, there's even a horizontal min instruction that gives you the min element out of the 8 in one vector, as user-number-guy commented. This cuts out the whole split / min narrowing process that's needed after getting the min candidates down to fitting in one vector.

Upvotes: 5

Related Questions