Reputation: 139
I'm not too familiar with SIMD, but I wrote some very simple stuff with AVX earlier. Now I would like to implement some old AVX codes with AVX-512 too.
What I intend to do:
// SIZE, LOW_THRESHOLD, HIGH_THRESHOLD and array are defined
// the code works with float data
for ( int index = 0; index < SIZE; ++index )
{
array[ index ] = LOW_THRESHOLD < array[ index ] && array[ index ] < HIGH_THRESHOLD ? 1.0f : 0.0f;
}
What I've done with AVX:
const __m256 lowThreshold = _mm256_set1_ps( LOW_THRESHOLD );
const __m256 highThreshold = _mm256_set1_ps( HIGH_THRESHOLD );
const __m256 trueValue = _mm256_set1_ps( 1.0f );
const __m256 falseValue = _mm256_set1_ps( 0.0f );
for ( int index = 0; index < SIZE; index += 8 )
{
// aligned load
const __m256 val = _mm256_load_ps( array + index );
// compare
const __m256 comp1 = _mm256_cmp_ps( lowThreshold, val , _CMP_LT_OQ );
const __m256 comp2 = _mm256_cmp_ps( val , highThreshold, _CMP_LT_OQ );
// AND
const __m256 mask = _mm256_and_ps( comp1, comp2 );
// blend
const __m256 result = _mm256_blendv_ps( falseValue, trueValue, mask );
// aligned store
_mm256_store_ps( array + index, result );
}
Now I'm stuck at AVX-512.
const __m512 lowThreshold = _mm512_set1_ps( LOW_THRESHOLD );
const __m512 highThreshold = _mm512_set1_ps( HIGH_THRESHOLD );
const __m512 trueValue = _mm512_set1_ps( 1.0f );
const __m512 falseValue = _mm512_set1_ps( 0.0f );
for ( int index = 0; index < SIZE; index += 16 )
{
// aligned load
const __m512 val = _mm512_load_ps( array + index );
// the result of the comparison goes into a mask?
const __mmask16 comp1 = _mm512_cmplt_ps_mask( lowThreshold, val );
const __mmask16 comp2 = _mm512_cmplt_ps_mask( val, highThreshold );
// how to use these masks?
}
It would be nice to use __m512 _mm512_and_ps (__m512 a, __m512 b)
, but there are only __mask16 variables after the comparison and I didn't find any _mm512 function such as _mm256_cmp_ps
. It is probably an easy issue for the more experienced AVX users.
Thanks!
Upvotes: 2
Views: 1403
Reputation: 114
If you look at the type definition of __mmask16
, you'll see: typedef unsigned short __mmask16;
. So you can treat this type like uint16_t and just use '&'.
Then you can use __m512 _mm512_mask_blend_ps (__mmask16 k, __m512 a, __m512 b)
.
This works perfectly for me:
#include <immintrin.h>
#include <stdio.h>
#define LOW_THRESHOLD 1
#define HIGH_THRESHOLD 3
int main() {
__m512 lowThreshold = _mm512_set1_ps( LOW_THRESHOLD );
__m512 highThreshold = _mm512_set1_ps( HIGH_THRESHOLD );
__m512 trueValue = _mm512_set1_ps( 1.0f );
__m512 falseValue = _mm512_set1_ps( 0.0f );
float array[16] = {-5.0f,6.0f,4.0f,1.5f,0.7f,1.0f,1.0f,-5.0f,6.0f,4.0f,1.5f,0.7f,1.0f,1.0f,-5.0f,6.0f};
for (int i = 0; i < 16; i += 1) {
printf("%5.1f ", array[i]);
}
printf("\n");
for ( int index = 0; index < 16; index += 16 )
{
__m512 val = _mm512_loadu_ps( array + index );
__mmask16 comp1 = _mm512_cmplt_ps_mask( lowThreshold, val );
__mmask16 comp2 = _mm512_cmplt_ps_mask( val, highThreshold );
__mmask16 mask = comp1 & comp2;
__m512 result = _mm512_mask_blend_ps(mask, falseValue, trueValue);
_mm512_storeu_ps( array + index, result );
}
for (int i = 0; i < 16; i += 1) {
printf("%5.1f ", array[i]);
}
printf("\n");
}
As a further optimization, you can use a zero-masked compare-into-mask to AND them for free, like _mm512_mask_cmplt_ps_mask(comp1, val, highThreshold)
. A mask bit is 1 only where comp1
was set and the FP compare is true.
Clang already optimizes the above code to do that; note the vcmpltps k1 {k1}, zmm0, dword ptr [rip + .LCPI0_1]{1to16}
in the asm output (Godbolt). (It's rewriting the same mask register. And the memory source operand is a broadcast-load.)
falseValue
is 0.0, so zero-masking can produce it instead of actually needing a constant in a register to blend from. But there aren't zero-masked stores, only merge-masking, so if we're just storing the result still need a separate instruction to produce the blend result in a register.
But it can just be a zero-masking move, like the vmovaps zmm0 {k1} {z}, zmm1
clang uses. So clang's already doing this optimization, as if you'd used _mm512_maskz_mov_ps(mask, _mm512_set1_ps(1.0f))
in the source. That avoids needing a set1(0.0)
constant in a register, and might be cheaper than a full blend on some CPUs.
(GCC makes the same optimizations as clang in this case, but MSVC uses an actual kandw
instruction, and vxorps
+ vblendmps
to blend.)
Upvotes: 4