Apply a given function on a 256 bit vector using SIMD paradigm

Question

Is there a way to evaluate a function along a __m256d/s vector? Like this:

#include 

inline __m256d func(__m256d *a, __m256d *b)
{
    return 1 / ((*a + *b) * (*a + *b));
}

int main()
{
    __m256d a = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d b = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d c = func(a, b);

    return 0;
}

I would like to evaluate any given mathematical function using the SIMD paradigm. If this isn't possible, wouldn't this be the biggest limitation of SIMD programming Vs GPGPU? I mean I've realized that the compute power in terms of FLOPS of CPUs is getting closer to GPUs, some comparsions:

Nvidia Quadro K6000 ~ 5196 GFLOPS
Nvidia Quadro K5000 ~ 2169 GFLOPS
Intel Xeon E5-2699 v3 ~ 1728 GFLOPS (18 cores * 32 FLOP/cycle * 3 Ghz)

Future guesses:

AVX-512 and probable 20 cores Xeon CPUs 3840 GLOPS (20 cores * 64 FLOP/cycle * 3 Ghz)
Knights Landing 5907 GFLOPS (71 cores * 64 FLOP/cycle * 1.3 Ghz)

VAndrei · Accepted Answer

Your question is very interesting. What you are describing cannot be done using existing compilers. If you overwrite your basic operators handling the 256b vectors you might be able to get close to your desired functionality.

However I would not say that this is the biggest limitation of SIMD programming vs GPGPU. The main advantage of GPGPU is FLOPS count but this comes at some cost. One is that GPGPUs don't handle branches very well, don't do well with threads dealing with large local data, etc. Another limitation is that the GPGPU programming model is rather complex, compared with traditional coding.

On a CPU you can run more general codes and the compiler will vectorize most of the times, without asking the programmer to write specific intrinsics.

So I'd go further and say that simple code is actually an advantage for CPUs. Consider the amount of effort necessary to port 20 years FORTRAN software to a GPGPU. While if you have a good compiler, and a good CPU (with good FLOP count), you might get expected performance.

Apply a given function on a 256 bit vector using SIMD paradigm

Answers (1)

Related Questions