Hallgeir
Hallgeir

Reputation: 1273

SSE (SIMD): multiply vector by scalar

A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2)) and then multiplying?

This is what I do now:

__m128 _scalar = _mm_set_ps(s,s,s,s);
__m128 _result = _mm_mul_ps(_vector, _scalar);

I'm looking for something like...

__m128 _result = _mm_scale_ps(_vector, s);

Upvotes: 27

Views: 20429

Answers (3)

Marat Dukhan
Marat Dukhan

Reputation: 12273

There is no instruction for multiplication of a vector by a scalar. There, however, some instructions for loading the same scalar values into all positions in a vector register.

AVX instruction set provides _mm_broadcast_ss/_mm256_broadcast_ss/_mm256_broadcast_sd intrinsics for populating SSE and AVX registers with the same float/double value.

In SSE3 instruction set you may find _mm_loaddup_pd intrinsic which populates SSE register with the same double value.

In other versions of SSE typically the best option is to load a scalar value using _mm_load_ss/_mm_load_sd and then copy it to all elements of a vector register with _mm_shuffle_ps/_mm_unpacklo_pd.

Upvotes: 8

Paul R
Paul R

Reputation: 213120

Depending on your compiler you may be able to improve the code generation a little by using _mm_set1_ps:

const __m128 scalar = _mm_set1_ps(s);
__m128 result = _mm_mul_ps(vector, scalar);

However scalar constants like this should only need to be initialised once, outside any loops, so the performance cost should be irrelevant. (Unless the scalar value is changing within the loop ?)

As always you should look at the code your compiler generates and also try running your code under a decent profiler to see where the hotspots really are.

Upvotes: 15

Jason R
Jason R

Reputation: 11758

I don't know of any single instruction that does what you want. Is the set operation truly a bottleneck? If you're multiplying a large vector by the same constant, the time it takes to fill an XMM/YMM register with four copies of the constant should be a very small fraction of the overall time taken.

As a simple optimization, if the constant is 2 as it was in your example, you could replace the multiply with an add instruction instead, not requiring any constant.

Upvotes: 3

Related Questions