Reputation: 22149
I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.
I have some very simple sample code. With the USE_SSE
define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.
Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.
Here's the test code:
/*
Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>
#if defined(__GNUG__) || defined(__clang__)
/* GCC & CLANG */
#define SSVEC_FINLINE __attribute__((always_inline))
#elif defined(_WIN32) && defined(MSC_VER)
/* MSVC. */
#define SSVEC_FINLINE __forceinline
#else
#error Unsupported platform.
#endif
#ifdef USE_SSE
typedef __m128 vec4f;
inline void addvec4f(vec4f &a, vec4f const &b)
{
a = _mm_add_ps(a, b);
}
#else
typedef float vec4f[4];
inline void addvec4f(vec4f &a, vec4f const &b)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
#endif
int main(int argc, char *argv[])
{
int const count = 1e7;
#ifdef USE_SSE
printf("Using SSE.\n");
#else
printf("Not using SSE.\n");
#endif
vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};
for (int i = 0; i < count; ++i)
{
vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
addvec4f(data, val);
}
float result[4] = {0};
#ifdef USE_SSE
_mm_store_ps(result, data);
#else
result[0] = data[0];
result[1] = data[1];
result[2] = data[2];
result[3] = data[3];
#endif
printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);
return 0;
}
This is compiled with:
g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe
Apart from the explicit SSE-version being a bit quicker there is no difference in output.
Here's the assembly for the loop, first explicit SSE:
.L3:
subl $1, %eax
addps %xmm1, %xmm0
jne .L3
It inlined the call. Nice, more or less just a straight up _mm_add_ps
.
Array version:
.L3:
subl $1, %eax
addss %xmm0, %xmm1
addss %xmm0, %xmm2
addss %xmm0, %xmm3
addss %xmm0, %xmm4
jne .L3
It is using SSE math alright, but on each array member. Not really desirable.
My question is, how can I help GCC so that it can better optimise the array version of vec4f
?
Any Linux specific tips is helpful too, that's where the real code will run.
Upvotes: 5
Views: 3422
Reputation: 2219
Here is some tips based on your code to make gcc auto-vectorization works:
remove the inline
keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3
.
so, to make your code vectorized, your addvec4f
function should be modified as the following:
void addvec4f(vec4f &a, vec4f const &b)
{
int i = 0;
for(;i < 4; i++)
a[i] = a[i]+b[i];
}
BTW:
-ftree-vectorizer-verbose=2
, higher number will have more output information, currently the value can be 0
,1
,2
.Here is the documentation of this flag, and some other related flag.bus error
if the data is not aligned. Here is the reason. Upvotes: 5
Reputation: 158469
This LockLess
article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.
Upvotes: 8