CaG
CaG

Reputation: 75

How to compute the norm of 256-bits variable using Intel AVX

I'd like to compute the norm of a vector stored in a __mm256d variable.
In order to do so, I implemented the ymmnorm function saving the result is a __mm256d variable:

__m256d ymmnorm(__m256d const x)
{
    return _mm256_sqrt_pd(ymmdot(x, x));
};

exploiting the dot product function suggested here

__m256d ymmdot(__m256d const x, __m256d const y)
{
    __m256d xy = _mm256_mul_pd(x, y);
    __m256d temp = _mm256_hadd_pd(xy, xy);
    __m128d hi128 = _mm256_extractf128_pd(temp, 1);
    __m128d dotproduct = _mm_add_pd(_mm256_castpd256_pd128(temp), hi128);

    return _mm256_broadcast_pd(&dotproduct);
};

However, I am a newbie in the SIMD/AVX world. Thus, I am wondering: is there a smarter/better method to compute the vector norm of a 256-bits variable?

Upvotes: 1

Views: 439

Answers (1)

Soonts
Soonts

Reputation: 21936

Assuming you need that exact prototype, I would do it like this:

__m256d ymmnorm( __m256d x )
{
    const __m256d x2 = _mm256_mul_pd( x, x );
    __m128 vec16 = _mm_add_pd( _mm256_castpd256_pd128( x2 ), _mm256_extractf128_pd( x2 ) );
    vec16 = _mm_add_sd( vec16, _mm_unpackhi_pd( vec16, vec16 ) );
    vec16 = _mm_sqrt_sd( vec16 );
    return _mm256_broadcastsd_pd( vec16 );
};

Here’s an alternative but I’d expect the first one to be slightly faster on most processors.

__m256d ymmnorm( __m256d x )
{
    __m256d x2 = _mm256_mul_pd( x, x );
    __m256d tmp = _mm256_permute4x64_pd( x2, _MM_SHUFFLE( 1, 0, 3, 2 ) );
    x2 = _mm256_add_pd( x2, tmp );
    tmp = _mm256_permute_pd( x2, _MM_SHUFFLE2( 0, 1 ) );
    x2 = _mm256_add_pd( x2, tmp );
    return _mm256_sqrt_pd( x2 );
};

Upvotes: 1

Related Questions