Reputation: 1858
For some reason _mm256_rcp_pd
is not in AVX or AVX2.
In AVX512 we got _mm256_rcp14_pd
.
Is there a way to get a fast approximate reciprocal in double precision on AVX2? Are we supposed to convert to single precision and then back?
Upvotes: 1
Views: 291
Reputation: 18827
With some integer-cast-hacking, and a Newton–Raphson refinement step, you can get a somewhat accurate approximation with 3 uops. Latency is probably not too good, since this involves mixing integer and double operations. But it should be much better than divpd
.
This solution also assumes that all inputs are normalized doubles.
__m256d fastinv(__m256d y)
{
// exact results for powers of two
__m256i const magic = _mm256_set1_epi64x(0x7fe0'0000'0000'0000);
// Bit-magic: For powers of two this just inverts the exponent,
// and values between that are linearly interpolated
__m256d x = _mm256_castsi256_pd(_mm256_sub_epi64(magic,_mm256_castpd_si256(y)));
// Newton-Raphson refinement: x = x*(2.0 - x*y):
x = _mm256_mul_pd(x, _mm256_fnmadd_pd(x, y, _mm256_set1_pd(2.0)));
return x;
}
With the constants above, the inverse is exact for powers of two, but has an error of ~1.44% near sqrt(2)
.
If you fine-tune the magic
constant as well as the 2.0
constant or add another NR-step, you can increase the accuracy.
Godbolt link: https://godbolt.org/z/f7YhnhT96
Upvotes: 2