Luigi Castelli
Luigi Castelli

Reputation: 676

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values.
I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value;

My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement.

So the ideal solution will:
1) only use only AVX instructions.
2) minimize the number of instructions. (I am hoping for no more than 3-4 instructions)

Having said that, any elegant/efficient solution will be accepted, even if it doesn't adhere to the above guidelines.

Thanks for any help.

-Luigi

Upvotes: 19

Views: 11724

Answers (4)

Alex Jorgenson
Alex Jorgenson

Reputation: 912

This doesn't specifically answer your question since you are using doubles, but here is the code I used to get the maximum of 8 single values. It is built off of the answer by @celion and @Norbert P.

#define HORIZONTAL_MAX_256(ymmA, result) \
                                                            /*      [upper   |   lower]                                                                         */ \
                                                            /*      [7 6 5 4 | 3 2 1 0]                                                                         */ \
    __m256 v1 = ymmA;                                       /* v1 = [H G F E | D C B A]                                                                         */ \
    __m256 v2 = _mm256_permute_ps(v1, 0b10'11'00'01);       /* v2 = [G H E F | C D A B]                                                                         */ \
    __m256 v3 = _mm256_max_ps(v1, v2);                      /* v3 = [W=max(G,H) W=max(G,H) Z=max(E,F) Z=max(E,F) | Y=max(C,D) Y=max(C,D) X=max(A,B) X=max(A,B)] */ \
                                                            /* v3 = [W W Z Z | Y Y X X]                                                                         */ \
    __m256 v4 = _mm256_permute_ps(v3, 0b00'00'10'10);       /* v4 = [Z Z W W | X X Y Y]                                                                         */ \
    __m256 v5 = _mm256_max_ps(v3, v4);                      /* v5 = [J=max(Z,W) J=max(Z,W) J=max(Z,W) J=max(Z,W) | I=max(X,Y) I=max(X,Y) I=max(X,Y) I=max(X,Y)] */ \
                                                            /* v5 = [J J J J | I I I I]                                                                         */ \
    __m128 v6 = _mm256_extractf128_ps(v5, 1);               /* v6 = [- - - - | J J J J]                                                                         */ \
    __m128 v7 = _mm_max_ps(_mm256_castps256_ps128(v5), v6); /* v7 = [- - - - | M=max(I,J) M=max(I,J) M=max(I,J) M=max(I,J)]                                     */ \
                                                            /* v7 = [- - - - | M M M M]                                                                         */ \
                                                            /* M = max(I,J)                                                                                     */ \
                                                            /* M = max(max(X,Y),max(Z,W))                                                                       */ \
                                                            /* M = max(max(max(A,B),max(C,D)),max(max(E,F),max(G,H)))                                           */ \
    _mm_store_ss(&result, v7);

edit

Using the VCL2 (Vector Class Library 2) lib, it seems to produce assembly code that is similar to what Peter Cordes is talking about in the comments. Here is the assembly code that VCL2 generated for my project:

vextractf128    xmm1, ymm0, 1     # ymm0 is the register to find the min of
vmaxps          xmm1, xmm0, xmm1
vpermilpd       xmm2, xmm1, 3
vmaxps          xmm1, xmm2, xmm1
vpsrldq         xmm2, xmm1, 4
vmaxps          xmm1, xmm2, xmm1
vbroadcastss    ymm2, xmm1        # This would be the save, it is setting up for the next instructions specific to my code

Upvotes: 0

joyx
joyx

Reputation: 77

//Use the code to find the horizontal maximum
__m256 v1 = initial_vector;//example v1=[1 2 3 4 5 6 7 8]
__m256 v2 = _mm256_permute_ps(v1,(int)147);//147 is control code for rotate left by upper 4 elements and lower 4 elements separately v2=[2 3 4 1 6 7 8 5]
__m256 v3 = _mm256_max_ps(v1,v2);//v3=[2 3 4 4 6 7 8 8]
__m256 v4 = _mm256_permute_ps(v3,(int)147);//v4=[3 4 4 2 7 8 8 6]
__m256 v5 = _mm256_max_ps(v3,v4);//v5=[3 4 4 4 7 8 8 8]
__m256 v6 = _mm256_permute_ps(v5,(int)147);//v6=[4 4 4 3 8 8 8 7]
__m256 v7 = _mm256_max_ps(v5,v6);//contains max of upper four elements and lower 4 elements. v7=[4 4 4 4 8 8 8 8]

//to get max of this horizontal array. Note that the highest end of either upper or lower can contain the maximum
float ALIGN max_array[8];
float horizontal_max;
_mm256_store_ps(max_array, v7);
if(max_array[3] > max_array[7])
{
    horizontal_max = max_array[3];
}
else
{
    horizontal_max = max_array[7];
}

Upvotes: 3

Norbert P.
Norbert P.

Reputation: 2807

I don't think you can do much better than 4 instructions: 2 shuffles and 2 comparisons.

__m256d x = ...; // input

__m128d y = _mm256_extractf128_pd(x, 1); // extract x[2], and x[3]
__m128d m1 = _mm_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3])
__m128d m2 = _mm_permute_pd(m1, 1); // set m2[0] = m1[1], m2[1] = m1[0]
__m128d m = _mm_max_pd(m1, m2); // both m[0] and m[1] contain the horizontal max(x[0], x[1], x[2], x[3])

Trivial modification to only work with 256-bit vectors:

__m256d x = ...; // input

__m256d y = _mm256_permute2f128_pd(x, x, 1); // permute 128-bit values
__m256d m1 = _mm256_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3]), etc.
__m256d m2 = _mm256_permute_pd(m1, 5); // set m2[0] = m1[1], m2[1] = m1[0], etc.
__m256d m = _mm256_max_pd(m1, m2); // all m[0] ... m[3] contain the horizontal max(x[0], x[1], x[2], x[3])

(untested)

Upvotes: 24

celion
celion

Reputation: 4004

The general way of doing this for a vector v1 = [A, B, C, D] is

  1. Permute v1 to v2 = [C, D, A, B] (swap 0th and 2nd elements, and 1st and 3rd ones)
  2. Take the max; i.e. v3 = max(v1,v2). You now have [max(A,C), max(B,D), max(A,C), max(B,D)]
  3. Permute v3 to v4, swapping the 0th and 1st elements, and the 2nd and 3rd ones.
  4. Take the max again, i.e. v5 = max(v3,v4). Now v5 contains the horizontal max in all of its components.

Specifically for AVX, the permutations can be done with _mm256_permute_pd and the maximums can be done with _mm256_max_pd. I don't have the exact permute masks handy but they should be pretty straightforward to figure out.

Hope that helps.

Upvotes: 13

Related Questions