Reputation: 75
I'd like to compute the norm of a vector stored in a __mm256d
variable.
In order to do so, I implemented the ymmnorm
function saving the result is a __mm256d
variable:
__m256d ymmnorm(__m256d const x)
{
return _mm256_sqrt_pd(ymmdot(x, x));
};
exploiting the dot product function suggested here
__m256d ymmdot(__m256d const x, __m256d const y)
{
__m256d xy = _mm256_mul_pd(x, y);
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d dotproduct = _mm_add_pd(_mm256_castpd256_pd128(temp), hi128);
return _mm256_broadcast_pd(&dotproduct);
};
However, I am a newbie in the SIMD/AVX world. Thus, I am wondering: is there a smarter/better method to compute the vector norm of a 256-bits variable?
Upvotes: 1
Views: 439
Reputation: 21936
Assuming you need that exact prototype, I would do it like this:
__m256d ymmnorm( __m256d x )
{
const __m256d x2 = _mm256_mul_pd( x, x );
__m128 vec16 = _mm_add_pd( _mm256_castpd256_pd128( x2 ), _mm256_extractf128_pd( x2 ) );
vec16 = _mm_add_sd( vec16, _mm_unpackhi_pd( vec16, vec16 ) );
vec16 = _mm_sqrt_sd( vec16 );
return _mm256_broadcastsd_pd( vec16 );
};
Here’s an alternative but I’d expect the first one to be slightly faster on most processors.
__m256d ymmnorm( __m256d x )
{
__m256d x2 = _mm256_mul_pd( x, x );
__m256d tmp = _mm256_permute4x64_pd( x2, _MM_SHUFFLE( 1, 0, 3, 2 ) );
x2 = _mm256_add_pd( x2, tmp );
tmp = _mm256_permute_pd( x2, _MM_SHUFFLE2( 0, 1 ) );
x2 = _mm256_add_pd( x2, tmp );
return _mm256_sqrt_pd( x2 );
};
Upvotes: 1