Reputation: 33
I was try trying to find the min of 8 long ints
using AVX2
. I am a greenie for SIMD
programming and I have no idea where to start. I did not see any post/example which explains how to carry out min
and max
in AVX2
. I know that I cannot exceed more than 4 long ints
due to the 256 bit
limit, but I can solve my problem using three steps . Also I cannot figure out how to load the data of an already existing normal long int array
into vectors
for avx2
.
I know the idea behind the process, This is what I am trying to achieve
long int nums = {1 , 2, 3 , 4 , 5 , 6 , 7, 8}
a = min(1,2) ; b = min(3,4) ; c = min(5,6) ; d = min(7,8)
x = min(a,b) ; y = min(c,d)
answer = min(x,y)
Can someone help me out as to how to get this to work. Also the last min
is a single operation , is it better to do it on the CPU
? Should I use something else other than AVX2
? ( I am on a x86
system)
Upvotes: 3
Views: 3082
Reputation: 364180
For x86 optimization and so on, see the links on https://stackoverflow.com/tags/x86/info. Esp. Intel's intrinsics guide, and Agner Fog's stuff.
If you always have exactly 8 elements (64 bytes), that simplifies things a lot. One of the major challenges when vectorizing small stuff is to not add too much startup/cleanup overhead handling the leftover elements that don't fill a whole vector.
AVX2 doesn't have min/max instructions for packed 64bit ints. Only 8, 16, and 32. That means you need to emulate it with a compare that generates a mask (all-0s for elements where the condition is false, all-1s where true, so you can AND this mask to zero out elements in other vectors.) To save on actually doing the AND/ANDN and OR operations to combine things with the mask, there are blend instructions.
AVX-512 will bring a big speedup for this operation. (support coming in (xeon-only) Skylake). It has a _mm_min_epi64
. There's also a library function for this operation: __int64 _mm512_reduce_min_epi64 (__m512i a)
. I assume this intrinsic will emit a sequence of vpminsq
instructions. Intel lists it in their intrinsic finder, but it's just an Intel library function, not a machine instruction.
Here's an AVX2 implementation that should work. I haven't tested it, but the compiled output looks like the right instruction sequence. I may have gotten a comparison reversed in there somewhere, so check it.
The principle of operation is: get the elementwise min of two 256b vectors. Split that into two 128b vectors and get the elementwise min of that. Then take that vector of two 64b values back to GP registers and do the final min. Max is done at the same time, interleaved with the min.
(Oops, you mentioned min/max in your question, but now I see you only actually just wanted min. Removing the un-needed parts is trivial, and you can change it to a return value instead of storing results through pointers/references. A scalar version might be faster; better test in the context of where your app uses this operation (not a standalone microbenchmark).)
#include <stdint.h>
#include <immintrin.h>
int64_t input[8] = { 1, 2, 3, };
#define min(a,b) \
({ __typeof__ (a) _a = (a); __typeof__ (b) _b = (b); \
_a < _b ? _a : _b; })
#define max(a,b) \
({ __typeof__ (a) _a = (a); \
__typeof__ (b) _b = (b); \
_a > _b ? _a : _b; })
// put this where it can get inlined. You don't want to actually store the results to RAM
// or have the compiler-generated VZEROUPPER at the end for every use.
void minmax64(int64_t input[8], int64_t *minret, int64_t *maxret)
{
__m256i *in_vec = (__m256i*)input;
__m256i v0 = in_vec[0], v1=in_vec[1]; // _mm256_loadu_si256 is optional for AVX
__m256i gt = _mm256_cmpgt_epi64(v0, v1); // 0xff.. for elements where v0 > v1. 0 elsewhere
__m256i minv = _mm256_blendv_epi8(v0, v1, gt); // take bytes from v1 where gt=0xff (i.e. where v0>v1)
__m256i maxv = _mm256_blendv_epi8(v1, v0, gt); // input order reversed
/* for 8, 16, or 32b: cmp/blend isn't needed
minv = _mm256_min_epi32(v0,v1);
maxv = _mm256_min_epi32(v0,v1); // one insn shorter, but much faster (esp. latency)
And at the stage of having a 128b vectors holding the min and max candidates,
you'd shuffle and repeat to get the low 64, and optionally again for the low 32,
before extracting to GP regs to finish the comparisons.
*/
__m128i min0 = _mm256_castsi256_si128(minv); // stupid gcc 4.9.2 compiles this to a vmovdqa
__m128i min1 = _mm256_extracti128_si256(minv, 1); // extracti128(x, 0) should optimize away to nothing.
__m128i max0 = _mm256_castsi256_si128(maxv);
__m128i max1 = _mm256_extracti128_si256(maxv, 1);
__m128i gtmin = _mm_cmpgt_epi64(min0, min1);
__m128i gtmax = _mm_cmpgt_epi64(max0, max1);
min0 = _mm_blendv_epi8(min0, min1, gtmin);
max0 = _mm_blendv_epi8(max1, max0, gtmax);
int64_t tmp0 = _mm_cvtsi128_si64(min0); // tmp0 = max0.m128i_i64[0]; // MSVC only
int64_t tmp1 = _mm_extract_epi64(min0, 1);
*minret = min(tmp0, tmp1); // compiles to a quick cmp / cmovg of 64bit GP registers
tmp0 = _mm_cvtsi128_si64(max0);
tmp1 = _mm_extract_epi64(max0, 1);
*maxret = min(tmp0, tmp1);
}
This may or may not be faster than doing the whole thing in GP registers, since 64bit load is one uop, cmp
is one uop, and cmovcc
is only 2 uops (on Intel). Haswell can issue 4 uops per cycles. Until you get to the bottom of the compare tree, there's lots of independent work to do, and even so, cmp is 1 cycle latency, and cmov is 2. If you're interleaving the work for a min and a max at the same time, there's two separate dependency chains (or trees in this case).
The vector version has much higher latency than throughput. If you need this operation on multiple independent sets of 8 values, the vector version is probably going to do well. Otherwise, the 5 cycle latency of pcmpgt*
, and 2 cycle latency of blendv
is going to hurt. If there is other independent work that can be happening in parallel, then that's fine.
If you had smaller integers, pmin*
(signed or unsigned, 8, 16, or 32b) is 1 cycle latency, 2 per cycle throughput. For 16b unsigned elements only, there's even a horizontal min instruction that gives you the min element out of the 8 in one vector, as user-number-guy commented. This cuts out the whole split / min narrowing process that's needed after getting the min candidates down to fitting in one vector.
Upvotes: 5