I'm converting SSE2 sine and cosine functions (from Julien Pommier's sse_mathfun.h; based on the CEPHES sinf function) to use AVX in order to accept 8 float vectors or 4 doubles. So, Julien's function sin_ps becomes sin_ps8 (for 8 floats) and sin_pd4 for 4 doubles. (The "advanced" editor here fails to accept my code, so please visit http://arstechnica.com/civis/viewtopic.php?f=20&t=1227375 to see it.) Testing with clang 3.3 under Mac OS X 10.6.8 running on a 2011 Core2 i7 @ 2.7Ghz, benchmarking results look like this: sinf .. -> 27.7 millions of vector evaluations/second over 5.56e+07 iters (standard, scalar sinf() function) sin_ps .. -> 41.0 millions of vector evaluations/second over 8.22e+07 iters sin_pd4 .. -> 40.2 millions of vector evaluations/second over 8.06e+07 iters sin_ps8 .. -> 2.5 millions of vector evaluations/second over 5.1e+06 iters The cost of sin_ps8 is downright frightening, and it seems it is due to the use of _mm256_castsi256_ps . In fact, commenting out the line "poly_mask = _mm256_castsi256_ps(emmm2);" results in a more normal performance. sin_pd4 uses _mm_castsi128_pd, but it appears that is not (just) the mix of SSE and AVX instructions that is biting me in sin_ps8: when I emulate the _mm256_castsi256_ps calls with 2 calls to _mm_castsi128_ps, performance doesn't improve. emm2 and emm0 are pointers to emmm2 and emmm0, both v8si instances and thus (a priori) correctly aligned to 32 bits boundaries. See sse_mathfun.h and sse_mathfun_test.c for compilable code. Is there a(n easy) way to avoid the penalty I'm seeing?

Reputation: 776

converting SSE code to AVX - cost of _mm256_and_ps

I'm converting SSE2 sine and cosine functions (from Julien Pommier's sse_mathfun.h; based on the CEPHES sinf function) to use AVX in order to accept 8 float vectors or 4 doubles.

So, Julien's function sin_ps becomes sin_ps8 (for 8 floats) and sin_pd4 for 4 doubles. (The "advanced" editor here fails to accept my code, so please visit http://arstechnica.com/civis/viewtopic.php?f=20&t=1227375 to see it.)

Testing with clang 3.3 under Mac OS X 10.6.8 running on a 2011 Core2 i7 @ 2.7Ghz, benchmarking results look like this:

sinf .. -> 27.7 millions of vector evaluations/second over 5.56e+07 iters (standard, scalar sinf() function)

sin_ps .. -> 41.0 millions of vector evaluations/second over 8.22e+07 iters

sin_pd4 .. -> 40.2 millions of vector evaluations/second over 8.06e+07 iters

sin_ps8 .. -> 2.5 millions of vector evaluations/second over 5.1e+06 iters

The cost of sin_ps8 is downright frightening, and it seems it is due to the use of _mm256_castsi256_ps . In fact, commenting out the line "poly_mask = _mm256_castsi256_ps(emmm2);" results in a more normal performance. sin_pd4 uses _mm_castsi128_pd, but it appears that is not (just) the mix of SSE and AVX instructions that is biting me in sin_ps8: when I emulate the _mm256_castsi256_ps calls with 2 calls to _mm_castsi128_ps, performance doesn't improve. emm2 and emm0 are pointers to emmm2 and emmm0, both v8si instances and thus (a priori) correctly aligned to 32 bits boundaries.

See sse_mathfun.h and sse_mathfun_test.c for compilable code.

Is there a(n easy) way to avoid the penalty I'm seeing?

Upvotes: 1

Answers (2)

Z boson

Reputation: 33679

Maybe you could compare your code to the already working AVX extension of Julien Pommier SSE math functions?

http://software-lisc.fbk.eu/avx_mathfun/

This code works in GCC but not MSVC and only supports floats (float8) but I think you could easily extend it to use doubles (double4) as well. A quick comparison of your sin function shows that they are quite similar except for the SSE2 integer part.

Upvotes: 1

Cory Nelson

Reputation: 30011

Transferring stuff out of registers into memory isn't usually a good idea. You are doing this every time you store into a pointer.

Instead of this:

{ ALIGN32_BEG v4sf *yy ALIGN32_END = (v4sf*) &y;
         emm2[0] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[0] ), _v4si_pi32_1), _v4si_pi32_inv1),
         emm2[1] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[1] ), _v4si_pi32_1), _v4si_pi32_inv1);
         yy[0] = _mm_cvtepi32_ps(emm2[0]),
         yy[1] = _mm_cvtepi32_ps(emm2[1]);
      }

/* get the swap sign flag */
emm0[0] = _mm_slli_epi32(_mm_and_si128(emm2[0], _v4si_pi32_4), 29),
emm0[1] = _mm_slli_epi32(_mm_and_si128(emm2[1], _v4si_pi32_4), 29);

/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2

Both branches will be computed.
*/
emm2[0] = _mm_cmpeq_epi32(_mm_and_si128(emm2[0], _v4si_pi32_2), _mm_setzero_si128()),
emm2[1] = _mm_cmpeq_epi32(_mm_and_si128(emm2[1], _v4si_pi32_2), _mm_setzero_si128());

((v4sf*)&poly_mask)[0] = _mm_castsi128_ps(emm2[0]);
((v4sf*)&poly_mask)[1] = _mm_castsi128_ps(emm2[1]);
swap_sign_bit = _mm256_castsi256_ps(emmm0);

Try something like this:

__m128i emm2a = _mm_and_si128(_mm_add_epi32( _mm256_castps256_ps128(y), _v4si_pi32_1), _v4si_pi32_inv1);
__m128i emm2b = _mm_and_si128(_mm_add_epi32( _mm256_extractf128_ps(y, 1), _v4si_pi32_1), _v4si_pi32_inv1);

y = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_cvtepi32_ps(emm2a)), _mm_cvtepi32_ps(emm2b), 1);

/* get the swap sign flag */
__m128i emm0a = _mm_slli_epi32(_mm_and_si128(emm2a, _v4si_pi32_4), 29),
__m128i emm0b = _mm_slli_epi32(_mm_and_si128(emm2b, _v4si_pi32_4), 29);

swap_sign_bit = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm0a), emm0b, 1));

/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2

Both branches will be computed.
*/
emm2a = _mm_cmpeq_epi32(_mm_and_si128(emm2a, _v4si_pi32_2), _mm_setzero_si128()),
emm2b = _mm_cmpeq_epi32(_mm_and_si128(emm2b, _v4si_pi32_2), _mm_setzero_si128());

poly_mask = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm2a), emm2b, 1));

As mentioned in comments, cast intrinsics are purely compile-time and emit no instructions.

Upvotes: 1

converting SSE code to AVX - cost of _mm256_and_ps

Answers (2)

Related Questions