ulak blade
ulak blade

Reputation: 2665

SSE4 and SSE2 regarding integer and float performance - which is faster?

While you usually get better integer arithmetic performance than floating point performance on CPUs, could someone clarify what the case is with the SIMD versions.For instance:

 __m128i _mm_mul_epi32(__m128i a, __m128i b);
//(multiplies 2 integer vectors)

versus:

__m128 _mm_mul_ps(__m128 a , __m128 b );
//(multiplies 2 float vectors)

Which yields higher performance?(assuming the machine has SSE4 capabilities).I'm saying this, because I coded my own little math library based on SSE2 instructions and I don't know if I should have went right on with using __m128i.

Upvotes: 4

Views: 1523

Answers (1)

Z boson
Z boson

Reputation: 33699

Let me show the first place I go to answer these types of questions: the Intel Intrinsic Guide online. You provide the intrinsic and it tells you what it does and provides the latency and throughput for Nehalem through Haswell (and soon Broadwell) processors. Here are the results:

_mm_mul_ps

                Latency    Reciprocal throughput
Haswell         5          0.5
Ivy Bridge      5          1
Sandy Bridge    5          1
Westmere        4          1
Nehalem         4          1

_mm_mul_epi32

                Latency    Reciprocal throughput
Haswell         5          1
Ivy Bridge      3          1
Sandy Bridge    3          1
Westmere        3          1
Nehalem         3          1

Lower latency and reciprocal throughput are better. From these tables we can conclude that

  • except for Haswell the latency for _mm_mul_epi32 is less than for _mm_mul_ps,
  • on Haswell the latency is the same,
  • except for Haswell the throughput is the same,
  • on Haswell the throughput for _mm_mul_ps is twice that of _mm_mul_epi32.

The throughput on Haswell is the only major surprise.

If you want the results for pre-Nehalem processors and/or for AMD processors then see Agner Fog's Instruction tables manual or run his tests programs which he used to measure the latency and throughput.

Upvotes: 4

Related Questions