Reputation: 349
Is there a way to evaluate a function along a __m256d/s
vector? Like this:
#include <immintrin.h>
inline __m256d func(__m256d *a, __m256d *b)
{
return 1 / ((*a + *b) * (*a + *b));
}
int main()
{
__m256d a = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
__m256d b = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
__m256d c = func(a, b);
return 0;
}
I would like to evaluate any given mathematical function using the SIMD paradigm. If this isn't possible, wouldn't this be the biggest limitation of SIMD programming Vs GPGPU? I mean I've realized that the compute power in terms of FLOPS of CPUs is getting closer to GPUs, some comparsions:
Future guesses:
AVX-512 and probable 20 cores Xeon CPUs 3840 GLOPS (20 cores * 64 FLOP/cycle * 3 Ghz)
Knights Landing 5907 GFLOPS (71 cores * 64 FLOP/cycle * 1.3 Ghz)
Upvotes: 3
Views: 566
Reputation: 5570
Your question is very interesting. What you are describing cannot be done using existing compilers. If you overwrite your basic operators handling the 256b vectors you might be able to get close to your desired functionality.
However I would not say that this is the biggest limitation of SIMD programming vs GPGPU. The main advantage of GPGPU is FLOPS count but this comes at some cost. One is that GPGPUs don't handle branches very well, don't do well with threads dealing with large local data, etc. Another limitation is that the GPGPU programming model is rather complex, compared with traditional coding.
On a CPU you can run more general codes and the compiler will vectorize most of the times, without asking the programmer to write specific intrinsics.
So I'd go further and say that simple code is actually an advantage for CPUs. Consider the amount of effort necessary to port 20 years FORTRAN software to a GPGPU. While if you have a good compiler, and a good CPU (with good FLOP count), you might get expected performance.
Upvotes: 2