Reputation: 180303
I have std::vector<double> X,Y
both of size N
(with N%16==0
) and I want to calculate sum(X[i]*Y[i])
. That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.
How do I get GCC to emit that AVX code? -mfma
is part of the solution, but do I need other switches?
And is std::vector<double>::operator[]
hindering this? I know I can transform
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];
to
size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];
so the compiler can spot that &X[0]
doesn't change in the loop. But is this sufficient or even necessary?
Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.
Upvotes: 3
Views: 530
Reputation: 33709
Did you look at the assembly? I put
double foo(std::vector<double> &X, std::vector<double> &Y) {
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i <N; ++i) sum += X[i] * Y[i];
return sum;
}
into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with -O3 -mfma
and I see
.L3:
vmovsd (%rcx,%rax,8), %xmm1
vfmadd231sd (%rsi,%rax,8), %xmm1, %xmm0
addq $1, %rax
cmpq %rdx, %rax
jne .L3
So it uses fma. However, it doest not vectorize the loop (The s
in sd
means single (i.e. not packed) and the d
means double floating point).
To vectorize the loop you need to enable associative math e.g. with -Ofast
. Using -Ofast -mavx2 -mfma
gives
.L8:
vmovupd (%rax,%rsi), %xmm2
addq $1, %r10
vinsertf128 $0x1, 16(%rax,%rsi), %ymm2, %ymm2
vfmadd231pd (%r12,%rsi), %ymm2, %ymm1
addq $32, %rsi
cmpq %r10, %rdi
ja .L8
So now it's vectorized (pd
means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.
Note that unlike GCC, Clang does not use fma by default with -mfma
. In order to use fma with Clang use -ffp-contract=fast
(e.g. -O3 -mfma -ffp-contract=fast
) or #pragma STDC FP_CONTRACT ON
or enable associative math with e.g. -Ofast
You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.
See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.
GCC creates a lot of extra code to handle misalignment and for N
not a multiples of 8. You can tell the compiler to assume the arrays are aligned using __builtin_assume_aligned
and that N is a multiple of 8 using N & -8
The following code with -Ofast -mavx2 -mfma
double foo2(double * __restrict X, double * __restrict Y, int N) {
X = (double*)__builtin_assume_aligned(X,32);
Y = (double*)__builtin_assume_aligned(Y,32);
double sum = 0.0;
for (int i = 0; i < (N &-8); ++i) sum += X[i] * Y[i];
return sum;
}
produces the following simple assembly
andl $-8, %edx
jle .L4
subl $4, %edx
vxorpd %xmm0, %xmm0, %xmm0
shrl $2, %edx
xorl %ecx, %ecx
leal 1(%rdx), %eax
xorl %edx, %edx
.L3:
vmovapd (%rsi,%rdx), %ymm2
addl $1, %ecx
vfmadd231pd (%rdi,%rdx), %ymm2, %ymm0
addq $32, %rdx
cmpl %eax, %ecx
jb .L3
vhaddpd %ymm0, %ymm0, %ymm0
vperm2f128 $1, %ymm0, %ymm0, %ymm1
vaddpd %ymm1, %ymm0, %ymm0
vzeroupper
ret
.L4:
vxorpd %xmm0, %xmm0, %xmm0
ret
Upvotes: 3
Reputation: 1439
I'm not sure this will get you all the way there, but I'm almost sure that a big part of the solution.
You have to break the loop into two: 0 to N, with step M>1. I'd try with M of 16, 8, 4, and look at the asm. And a inner loop of 0 to M. Don't worry about the math iterator math. Gcc is smart enough with it.
Gcc should unroll the inner loop and them it can SIMD it and maybe use FMA.
Upvotes: 0