Reputation: 2665
While you usually get better integer arithmetic performance than floating point performance on CPUs, could someone clarify what the case is with the SIMD versions.For instance:
__m128i _mm_mul_epi32(__m128i a, __m128i b);
//(multiplies 2 integer vectors)
versus:
__m128 _mm_mul_ps(__m128 a , __m128 b );
//(multiplies 2 float vectors)
Which yields higher performance?(assuming the machine has SSE4 capabilities).I'm saying this, because I coded my own little math library based on SSE2 instructions and I don't know if I should have went right on with using __m128i.
Upvotes: 4
Views: 1523
Reputation: 33699
Let me show the first place I go to answer these types of questions: the Intel Intrinsic Guide online. You provide the intrinsic and it tells you what it does and provides the latency and throughput for Nehalem through Haswell (and soon Broadwell) processors. Here are the results:
_mm_mul_ps
Latency Reciprocal throughput
Haswell 5 0.5
Ivy Bridge 5 1
Sandy Bridge 5 1
Westmere 4 1
Nehalem 4 1
_mm_mul_epi32
Latency Reciprocal throughput
Haswell 5 1
Ivy Bridge 3 1
Sandy Bridge 3 1
Westmere 3 1
Nehalem 3 1
Lower latency and reciprocal throughput are better. From these tables we can conclude that
_mm_mul_epi32
is less than for _mm_mul_ps
, _mm_mul_ps
is twice that of _mm_mul_epi32
.The throughput on Haswell is the only major surprise.
If you want the results for pre-Nehalem processors and/or for AMD processors then see Agner Fog's Instruction tables manual or run his tests programs which he used to measure the latency and throughput.
Upvotes: 4