Reputation: 21343
I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:
#include <complex.h>
complex float f(complex float x[], int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
n
will be at most 50.
Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:
.L3:
vmovaps xmm0, XMMWORD PTR 16[rsi]
add rsi, 32
vmulps xmm1, xmm0, xmm2
vmulps xmm0, xmm0, xmm3
vfmsubps xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
vmovaps xmm3, xmm1
vfmaddps xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
cmp rdx, rsi
jne .L3
However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.
How can you write C or C++ code for which gcc will produce optimal code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?
Are there any open source libraries that make this easier?
Upvotes: 30
Views: 4548
Reputation: 18827
Here would be an example using the Eigen library:
#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
return Eigen::VectorXcf::Map(x, n).prod();
}
If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x
is aligned, you can use MapAligned
instead of Map
.
You get even better code, if you happen to know the size of your vector at compile time using this:
template<int n>
std::complex<float> f(const std::complex<float> *x)
{
return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}
Note: The functions above directly correspond to the function f
of the OP.
However, as @PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.
Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:
typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
return a*b;
}
Or more generic (using C++11):
template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;
template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
return a*b;
}
When compiled with -mavx -O2
, this compiles to something like this (using g++-5.4):
vmovaps 32(%rsi), %ymm1
movq %rdi, %rax
vmovaps (%rsi), %ymm0
vmovaps 32(%rdi), %ymm3
vmovaps (%rdi), %ymm4
vmulps %ymm0, %ymm3, %ymm2
vmulps %ymm4, %ymm1, %ymm5
vmulps %ymm4, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm5, %ymm2, %ymm2
vsubps %ymm1, %ymm0, %ymm0
vmovaps %ymm2, 32(%rdi)
vmovaps %ymm0, (%rdi)
vzeroupper
ret
For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:
vmovaps (%rsi), %ymm0
vmovaps %ymm0, (%rdi)
vmovaps 32(%rsi), %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps (%rdi), %ymm1
vmovaps (%rdx), %ymm2
vmovaps 32(%rdx), %ymm3
vmulps %ymm2, %ymm1, %ymm4
vmulps %ymm3, %ymm0, %ymm5
vsubps %ymm5, %ymm4, %ymm4
vmulps %ymm3, %ymm1, %ymm1
vmulps %ymm0, %ymm2, %ymm0
vaddps %ymm1, %ymm0, %ymm0
vmovaps %ymm0, 32(%rdi)
vmovaps %ymm4, (%rdi)
movq %rdi, %rax
vzeroupper
retq
Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:
complex8 f8(complex8 x[], int n) {
if(n==0)
return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?
complex8 p = x[0];
for (int i = 1; i < n; i++) p = prod(p, x[i]);
return p;
}
The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma
.
The f8
function can of course also be generalized to arbitrary dimensions:
template<int size>
complexX<size> fX(complexX<size> x[], int n) {
using S= typename complexX<size>::value_type;
if(n==0)
return complexX<size>(S::Ones(),S::Zero());
complexX<size> p = x[0];
for (int i = 1; i < n; i++) p *=x[i];
return p;
}
And for reducing the complexX<N>
to a single std::complex
the following function can be used:
// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
return std::complex<float>(var.real()[0], var.imag()[0]);
}
However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...) Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).
Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.
Upvotes: 14
Reputation: 2536
If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance
In this paper(UME::SIMD) an interface based on Vc has been established which is named UME::SIMD. It allows the programmer to access the SIMD capabilities without the need for extensive knowledge of SIMD ISAs. UME::SIMD provides a simple, flexible and portable abstraction for explicit vectorization without performance losses compared to intrinsics
Upvotes: 3
Reputation: 28300
I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:
typedef float v4sf __attribute__ ((vector_size (32)));
Also increase all arrays to have 8 elements:
typedef float v8sf __attribute__ ((vector_size (32)));
typedef union {
v8sf v;
float e[8];
} float8;
typedef struct {
float8 x;
float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f
), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.
You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.
Upvotes: 1