Reputation: 847
I have the following code (minimal example):
#include <iostream>
#include <immintrin.h>
using namespace std;
int main(){
__m128i a = _mm_set_epi32(rand(),rand(),rand(),rand());
__m128i b = _mm_set_epi32(rand(),rand(),rand(),rand());
__m128i c = _mm_add_epi32(a,b);
int d[4];
_mm_storeu_si128((__m128i*)d,c);
cout<<d[0]<<endl;
cout<<d[1]<<endl;
cout<<d[2]<<endl;
cout<<d[3]<<endl;
return 0;
}
When compiled with g++ -O3 -march=native
, it produces some strange/bad/inefficient assembly (https://godbolt.org/z/TQgbim). It stores c
once and then it does an aligned load and an extract to do the element accesses (each time). I can see why it needs to store it to memory, and I can see how an aligned load and extract might be efficient, but I don't see why it needs to keep loading the same data back into the xmm register after an extraction. Also, when d
is changed so that it is allocated on the heap (https://godbolt.org/z/Pk7qP2), it doesn't even do the aligned loads anymore, it just treats d
as a normal array and accesses the elements that way. Could someone please explain why it's doing this and what possible benefit it could bring? Thanks.
Upvotes: 1
Views: 228
Reputation: 366066
Yup, that's an amusing missed-optimization.
Looks like it decided to optimize vector store / scalar reload into vector extract, which is normally good.
But it did so without taking into account the calling convention, which has no call-preserved vector registers. This code would have been fine on Windows x64, where it could use xmm6, for example.
This code would also have been fine if you called functions which inlined, or if you passed all 4 elements as args to the same function. (e.g. printf
).
GCC has multiple passes, and the architecture-neutral middle passes operating on a GIMPLE representation of the program logic sometimes can't take advantage of full details that aren't known until register-allocation time. Some optimizations are hard for gcc because it just isn't wired to be able to see them.
BTW, if you care about that level of efficiency, use '\n'
instead of endl
. You don't need to explicitly flush cout
there.
Upvotes: 4