Terrordrone
Terrordrone

Reputation: 139

AVX-512 floating point comparison and masking

I'm not too familiar with SIMD, but I wrote some very simple stuff with AVX earlier. Now I would like to implement some old AVX codes with AVX-512 too.

What I intend to do:

// SIZE, LOW_THRESHOLD, HIGH_THRESHOLD and array are defined
// the code works with float data

for ( int index = 0; index < SIZE; ++index )
{
    array[ index ] = LOW_THRESHOLD < array[ index ] && array[ index ] < HIGH_THRESHOLD ? 1.0f : 0.0f;
}

What I've done with AVX:

const __m256 lowThreshold  = _mm256_set1_ps( LOW_THRESHOLD );
const __m256 highThreshold = _mm256_set1_ps( HIGH_THRESHOLD );
const __m256 trueValue     = _mm256_set1_ps( 1.0f );
const __m256 falseValue    = _mm256_set1_ps( 0.0f );

for ( int index = 0; index < SIZE; index += 8 )
{
    // aligned load
    const __m256 val = _mm256_load_ps( array + index );
    // compare
    const __m256 comp1 = _mm256_cmp_ps( lowThreshold, val , _CMP_LT_OQ );
    const __m256 comp2 = _mm256_cmp_ps( val , highThreshold, _CMP_LT_OQ );
    // AND
    const __m256 mask = _mm256_and_ps( comp1, comp2 );
    // blend
    const __m256 result = _mm256_blendv_ps( falseValue, trueValue, mask );
    // aligned store
    _mm256_store_ps( array + index, result );
}

Now I'm stuck at AVX-512.

const __m512 lowThreshold  = _mm512_set1_ps( LOW_THRESHOLD );
const __m512 highThreshold = _mm512_set1_ps( HIGH_THRESHOLD );
const __m512 trueValue     = _mm512_set1_ps( 1.0f );
const __m512 falseValue    = _mm512_set1_ps( 0.0f );

for ( int index = 0; index < SIZE; index += 16 )
{
    // aligned load
    const __m512 val = _mm512_load_ps( array + index );

    // the result of the comparison goes into a mask?
    const __mmask16 comp1 = _mm512_cmplt_ps_mask( lowThreshold, val );
    const __mmask16 comp2 = _mm512_cmplt_ps_mask( val, highThreshold );

    // how to use these masks?
}

It would be nice to use __m512 _mm512_and_ps (__m512 a, __m512 b), but there are only __mask16 variables after the comparison and I didn't find any _mm512 function such as _mm256_cmp_ps. It is probably an easy issue for the more experienced AVX users. Thanks!

Upvotes: 2

Views: 1403

Answers (1)

Arthur Woimb&#233;e
Arthur Woimb&#233;e

Reputation: 114

If you look at the type definition of __mmask16, you'll see: typedef unsigned short __mmask16;. So you can treat this type like uint16_t and just use '&'. Then you can use __m512 _mm512_mask_blend_ps (__mmask16 k, __m512 a, __m512 b).

This works perfectly for me:

#include <immintrin.h>
#include <stdio.h>

#define LOW_THRESHOLD 1
#define HIGH_THRESHOLD 3

int main() {
    __m512 lowThreshold  = _mm512_set1_ps( LOW_THRESHOLD );
    __m512 highThreshold = _mm512_set1_ps( HIGH_THRESHOLD );
    __m512 trueValue     = _mm512_set1_ps( 1.0f );
    __m512 falseValue    = _mm512_set1_ps( 0.0f );

    float array[16] = {-5.0f,6.0f,4.0f,1.5f,0.7f,1.0f,1.0f,-5.0f,6.0f,4.0f,1.5f,0.7f,1.0f,1.0f,-5.0f,6.0f};

    for (int i = 0; i < 16; i += 1) {
        printf("%5.1f ", array[i]);
    }
    printf("\n");

    for ( int index = 0; index < 16; index += 16 )
    {
        __m512 val = _mm512_loadu_ps( array + index );
        __mmask16 comp1 = _mm512_cmplt_ps_mask( lowThreshold, val );
        __mmask16 comp2 = _mm512_cmplt_ps_mask( val, highThreshold );
        __mmask16 mask = comp1 & comp2;
        __m512 result = _mm512_mask_blend_ps(mask, falseValue, trueValue);
        _mm512_storeu_ps( array + index, result );
    }
    for (int i = 0; i < 16; i += 1) {
        printf("%5.1f ", array[i]);
    }
    printf("\n");
}

As a further optimization, you can use a zero-masked compare-into-mask to AND them for free, like _mm512_mask_cmplt_ps_mask(comp1, val, highThreshold). A mask bit is 1 only where comp1 was set and the FP compare is true.

Clang already optimizes the above code to do that; note the vcmpltps k1 {k1}, zmm0, dword ptr [rip + .LCPI0_1]{1to16} in the asm output (Godbolt). (It's rewriting the same mask register. And the memory source operand is a broadcast-load.)

falseValue is 0.0, so zero-masking can produce it instead of actually needing a constant in a register to blend from. But there aren't zero-masked stores, only merge-masking, so if we're just storing the result still need a separate instruction to produce the blend result in a register.

But it can just be a zero-masking move, like the vmovaps zmm0 {k1} {z}, zmm1 clang uses. So clang's already doing this optimization, as if you'd used _mm512_maskz_mov_ps(mask, _mm512_set1_ps(1.0f)) in the source. That avoids needing a set1(0.0) constant in a register, and might be cheaper than a full blend on some CPUs.

(GCC makes the same optimizations as clang in this case, but MSVC uses an actual kandw instruction, and vxorps + vblendmps to blend.)

Upvotes: 4

Related Questions