Reputation: 181
I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Upvotes: 1
Views: 481
Reputation: 2749
There is an outside chance that this is related to my problem with system libraries distributed with the Mingw GCC port using x87 80 bit FP internally.
Based on a couple of random test pieces not intended to be vectoriser friendly I reckon the improvement for going from pure x87 code to SSE2 or higher at double precision should be around 25% on arithmetic heavy code that isn't memory bandwidth limited. And AOTBE another 40+% faster for float vs double.
double BM_Pi(double x)
{
int n = 20;
double f = 1;
while (--n)
f = (2 * n - 1) + n * n / f; // phi has f = 1 + 1 /(f + 1) here
return 4/f;
}
Forcing x87 code generation by pretending to be an i586 and then allowing all optimisations -O3 these are the results for different compilers I have to hand:
Compiler | Phi | Pi |
---|---|---|
Intel AVX2 | 30 | 77 |
MS AVX2 | 63 | 102 |
MS x87 | 84 | 115 |
GCC AVX2 | 61 | 95 |
GCC x87 | 83 | 116 |
What I see on some memory intensive (not vector friendly) code is that SSE2 FP is sometimes fastest by about 20% than AVX and 10% faster than AVX2. It is obviously very code dependent. The Intel compiler has probably found a way to vectorise the Phi test (exact factor of two faster is suspicious).
Upvotes: 0
Reputation: 3911
There is no code, no description on test procedures, but it generally can be explained this way:
It's not all about cpu bound, it's also bounded by memory speed. Image processing usually have large working set and exceed the amount of cache of your non-xeon cpu. Eventually the cpu encounter starvation means the overall throughput can be bounded by memory speed.
You may be using an algorithm that is not friendly for vectorization. Not every algorithm benefits from being vectorized. There are many conditions have to meet - flow dependency, memory layout, etc.
Upvotes: 1
Reputation: 12068
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N
to see, what really was vectorized by the compiler.
N
is the verbosity level, make that 5
to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.
Upvotes: 5