MSalters
MSalters

Reputation: 180303

What do I need to do so GCC 4.9 recognizes the opportunity to use AVX FMA?

I have std::vector<double> X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.

How do I get GCC to emit that AVX code? -mfma is part of the solution, but do I need other switches?

And is std::vector<double>::operator[] hindering this? I know I can transform

size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];

to

size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];

so the compiler can spot that &X[0] doesn't change in the loop. But is this sufficient or even necessary?

Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.

Upvotes: 3

Views: 530

Answers (2)

Z boson
Z boson

Reputation: 33709

Did you look at the assembly? I put

double foo(std::vector<double> &X, std::vector<double> &Y) {
    size_t N = X.size();
    double sum = 0.0;
    for (size_t i = 0; i <N; ++i) sum += X[i] * Y[i];
    return sum;
}

into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with -O3 -mfma and I see

.L3:
        vmovsd  (%rcx,%rax,8), %xmm1
        vfmadd231sd     (%rsi,%rax,8), %xmm1, %xmm0
        addq    $1, %rax
        cmpq    %rdx, %rax
        jne     .L3

So it uses fma. However, it doest not vectorize the loop (The s in sd means single (i.e. not packed) and the d means double floating point).

To vectorize the loop you need to enable associative math e.g. with -Ofast. Using -Ofast -mavx2 -mfma gives

.L8:
        vmovupd (%rax,%rsi), %xmm2
        addq    $1, %r10
        vinsertf128     $0x1, 16(%rax,%rsi), %ymm2, %ymm2
        vfmadd231pd     (%r12,%rsi), %ymm2, %ymm1
        addq    $32, %rsi
        cmpq    %r10, %rdi
        ja      .L8

So now it's vectorized (pd means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.

Note that unlike GCC, Clang does not use fma by default with -mfma. In order to use fma with Clang use -ffp-contract=fast (e.g. -O3 -mfma -ffp-contract=fast) or #pragma STDC FP_CONTRACT ON or enable associative math with e.g. -Ofast You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.

See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.


GCC creates a lot of extra code to handle misalignment and for N not a multiples of 8. You can tell the compiler to assume the arrays are aligned using __builtin_assume_aligned and that N is a multiple of 8 using N & -8

The following code with -Ofast -mavx2 -mfma

double foo2(double * __restrict X, double * __restrict Y, int N) {
    X = (double*)__builtin_assume_aligned(X,32);
    Y = (double*)__builtin_assume_aligned(Y,32);
    double sum = 0.0;
    for (int i = 0; i < (N &-8); ++i) sum += X[i] * Y[i];
    return sum;
}

produces the following simple assembly

        andl    $-8, %edx
        jle     .L4
        subl    $4, %edx
        vxorpd  %xmm0, %xmm0, %xmm0
        shrl    $2, %edx
        xorl    %ecx, %ecx
        leal    1(%rdx), %eax
        xorl    %edx, %edx
.L3:
        vmovapd (%rsi,%rdx), %ymm2
        addl    $1, %ecx
        vfmadd231pd     (%rdi,%rdx), %ymm2, %ymm0
        addq    $32, %rdx
        cmpl    %eax, %ecx
        jb      .L3
        vhaddpd %ymm0, %ymm0, %ymm0
        vperm2f128      $1, %ymm0, %ymm0, %ymm1
        vaddpd  %ymm1, %ymm0, %ymm0
        vzeroupper
        ret
.L4:
        vxorpd  %xmm0, %xmm0, %xmm0
        ret

Upvotes: 3

yb303
yb303

Reputation: 1439

I'm not sure this will get you all the way there, but I'm almost sure that a big part of the solution.

You have to break the loop into two: 0 to N, with step M>1. I'd try with M of 16, 8, 4, and look at the asm. And a inner loop of 0 to M. Don't worry about the math iterator math. Gcc is smart enough with it.

Gcc should unroll the inner loop and them it can SIMD it and maybe use FMA.

Upvotes: 0

Related Questions