Reputation: 676
I have a __m256d vector packed with four 64-bit floating-point values.
I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value;
My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement.
So the ideal solution will:
1) only use only AVX instructions.
2) minimize the number of instructions. (I am hoping for no more than 3-4 instructions)
Having said that, any elegant/efficient solution will be accepted, even if it doesn't adhere to the above guidelines.
Thanks for any help.
-Luigi
Upvotes: 19
Views: 11724
Reputation: 912
This doesn't specifically answer your question since you are using doubles, but here is the code I used to get the maximum of 8 single values. It is built off of the answer by @celion and @Norbert P.
#define HORIZONTAL_MAX_256(ymmA, result) \
/* [upper | lower] */ \
/* [7 6 5 4 | 3 2 1 0] */ \
__m256 v1 = ymmA; /* v1 = [H G F E | D C B A] */ \
__m256 v2 = _mm256_permute_ps(v1, 0b10'11'00'01); /* v2 = [G H E F | C D A B] */ \
__m256 v3 = _mm256_max_ps(v1, v2); /* v3 = [W=max(G,H) W=max(G,H) Z=max(E,F) Z=max(E,F) | Y=max(C,D) Y=max(C,D) X=max(A,B) X=max(A,B)] */ \
/* v3 = [W W Z Z | Y Y X X] */ \
__m256 v4 = _mm256_permute_ps(v3, 0b00'00'10'10); /* v4 = [Z Z W W | X X Y Y] */ \
__m256 v5 = _mm256_max_ps(v3, v4); /* v5 = [J=max(Z,W) J=max(Z,W) J=max(Z,W) J=max(Z,W) | I=max(X,Y) I=max(X,Y) I=max(X,Y) I=max(X,Y)] */ \
/* v5 = [J J J J | I I I I] */ \
__m128 v6 = _mm256_extractf128_ps(v5, 1); /* v6 = [- - - - | J J J J] */ \
__m128 v7 = _mm_max_ps(_mm256_castps256_ps128(v5), v6); /* v7 = [- - - - | M=max(I,J) M=max(I,J) M=max(I,J) M=max(I,J)] */ \
/* v7 = [- - - - | M M M M] */ \
/* M = max(I,J) */ \
/* M = max(max(X,Y),max(Z,W)) */ \
/* M = max(max(max(A,B),max(C,D)),max(max(E,F),max(G,H))) */ \
_mm_store_ss(&result, v7);
edit
Using the VCL2 (Vector Class Library 2) lib, it seems to produce assembly code that is similar to what Peter Cordes is talking about in the comments. Here is the assembly code that VCL2 generated for my project:
vextractf128 xmm1, ymm0, 1 # ymm0 is the register to find the min of
vmaxps xmm1, xmm0, xmm1
vpermilpd xmm2, xmm1, 3
vmaxps xmm1, xmm2, xmm1
vpsrldq xmm2, xmm1, 4
vmaxps xmm1, xmm2, xmm1
vbroadcastss ymm2, xmm1 # This would be the save, it is setting up for the next instructions specific to my code
Upvotes: 0
Reputation: 77
//Use the code to find the horizontal maximum
__m256 v1 = initial_vector;//example v1=[1 2 3 4 5 6 7 8]
__m256 v2 = _mm256_permute_ps(v1,(int)147);//147 is control code for rotate left by upper 4 elements and lower 4 elements separately v2=[2 3 4 1 6 7 8 5]
__m256 v3 = _mm256_max_ps(v1,v2);//v3=[2 3 4 4 6 7 8 8]
__m256 v4 = _mm256_permute_ps(v3,(int)147);//v4=[3 4 4 2 7 8 8 6]
__m256 v5 = _mm256_max_ps(v3,v4);//v5=[3 4 4 4 7 8 8 8]
__m256 v6 = _mm256_permute_ps(v5,(int)147);//v6=[4 4 4 3 8 8 8 7]
__m256 v7 = _mm256_max_ps(v5,v6);//contains max of upper four elements and lower 4 elements. v7=[4 4 4 4 8 8 8 8]
//to get max of this horizontal array. Note that the highest end of either upper or lower can contain the maximum
float ALIGN max_array[8];
float horizontal_max;
_mm256_store_ps(max_array, v7);
if(max_array[3] > max_array[7])
{
horizontal_max = max_array[3];
}
else
{
horizontal_max = max_array[7];
}
Upvotes: 3
Reputation: 2807
I don't think you can do much better than 4 instructions: 2 shuffles and 2 comparisons.
__m256d x = ...; // input
__m128d y = _mm256_extractf128_pd(x, 1); // extract x[2], and x[3]
__m128d m1 = _mm_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3])
__m128d m2 = _mm_permute_pd(m1, 1); // set m2[0] = m1[1], m2[1] = m1[0]
__m128d m = _mm_max_pd(m1, m2); // both m[0] and m[1] contain the horizontal max(x[0], x[1], x[2], x[3])
Trivial modification to only work with 256-bit vectors:
__m256d x = ...; // input
__m256d y = _mm256_permute2f128_pd(x, x, 1); // permute 128-bit values
__m256d m1 = _mm256_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3]), etc.
__m256d m2 = _mm256_permute_pd(m1, 5); // set m2[0] = m1[1], m2[1] = m1[0], etc.
__m256d m = _mm256_max_pd(m1, m2); // all m[0] ... m[3] contain the horizontal max(x[0], x[1], x[2], x[3])
(untested)
Upvotes: 24
Reputation: 4004
The general way of doing this for a vector v1 = [A, B, C, D]
is
v1
to v2 = [C, D, A, B]
(swap 0th and 2nd elements, and 1st and 3rd ones)v3 = max(v1,v2)
. You now have [max(A,C), max(B,D), max(A,C), max(B,D)]
v3
to v4
, swapping the 0th and 1st elements, and the 2nd and 3rd ones.v5 = max(v3,v4)
. Now v5
contains the horizontal max in all of its components.Specifically for AVX, the permutations can be done with _mm256_permute_pd
and the maximums can be done with _mm256_max_pd
. I don't have the exact permute masks handy but they should be pretty straightforward to figure out.
Hope that helps.
Upvotes: 13