Reputation: 1273
A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2)) and then multiplying?
This is what I do now:
__m128 _scalar = _mm_set_ps(s,s,s,s);
__m128 _result = _mm_mul_ps(_vector, _scalar);
I'm looking for something like...
__m128 _result = _mm_scale_ps(_vector, s);
Upvotes: 27
Views: 20429
Reputation: 12273
There is no instruction for multiplication of a vector by a scalar. There, however, some instructions for loading the same scalar values into all positions in a vector register.
AVX instruction set provides _mm_broadcast_ss
/_mm256_broadcast_ss
/_mm256_broadcast_sd
intrinsics for populating SSE and AVX registers with the same float/double value.
In SSE3 instruction set you may find _mm_loaddup_pd
intrinsic which populates SSE register with the same double value.
In other versions of SSE typically the best option is to load a scalar value using _mm_load_ss
/_mm_load_sd
and then copy it to all elements of a vector register with _mm_shuffle_ps
/_mm_unpacklo_pd
.
Upvotes: 8
Reputation: 213120
Depending on your compiler you may be able to improve the code generation a little by using _mm_set1_ps
:
const __m128 scalar = _mm_set1_ps(s);
__m128 result = _mm_mul_ps(vector, scalar);
However scalar constants like this should only need to be initialised once, outside any loops, so the performance cost should be irrelevant. (Unless the scalar value is changing within the loop ?)
As always you should look at the code your compiler generates and also try running your code under a decent profiler to see where the hotspots really are.
Upvotes: 15
Reputation: 11758
I don't know of any single instruction that does what you want. Is the set operation truly a bottleneck? If you're multiplying a large vector by the same constant, the time it takes to fill an XMM/YMM register with four copies of the constant should be a very small fraction of the overall time taken.
As a simple optimization, if the constant is 2 as it was in your example, you could replace the multiply with an add instruction instead, not requiring any constant.
Upvotes: 3