Taylor
Taylor

Reputation: 6430

how can I get clang to vectorize a simple loop?

I have the following loop:

float* s;
float* ap;
float* bp;

... // initialize s, ap, bp

for(size_t i=0;i<64;++i) {
   s[i] = ap[i]+bp[i];
}

Seems like a good candidate for vectorization. Though I have optimization turned on, when I look at the assembly output, clang (I'm using Xcode) seems to not have vectorized the loop:

LBB33_1:                                ## =>This Inner Loop Header: Depth=1
    movss   (%rax,%rsi,4), %xmm0    ## xmm0 = mem[0],zero,zero,zero
    addss   (%rcx,%rsi,4), %xmm0
    movss   %xmm0, (%rdx,%rsi,4)
Ltmp353:
    incq    %rsi
Ltmp354:
    cmpq    $64, %rsi
Ltmp355:
    jne LBB33_1

How can I get clang/Xcode to vectorize this simple loop?

Upvotes: 6

Views: 7135

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365737

Use a non-ancient version of clang/LLVM. Apple clang/LLVM is different from mainline clang/LLVM, but they share a common codebase.

Mainline clang3.3 and newer auto-vectorize your loop at -O3. Clang3.4 and newer auto-vectorize it even at -O2.

Without restrict, clang does emit asm that checks for overlap between the destination and the two sources (with a fallback to scalar), so you'll get more efficient asm from float *restrict s.

#include <stdlib.h>
void add_float_good(float *restrict s, float *restrict ap, float *restrict bp)
{
    for(size_t i=0;i<64;++i) {
       s[i] = ap[i]+bp[i];
    }
}

compiles with clang3.4 -O3 (on the Godbolt compiler explorer) to this simplistic asm with the worst of indexed addressing modes and loop overhead, but at least it's vectorized. Newer clang likes to unroll, especially when tuning for recent Intel (e.g. -march=skylake)

# clang3.4 -O3
add_float_good:
        xor     eax, eax
.LBB0_1:                                # %vector.body
        movups  xmm0, xmmword ptr [rsi + 4*rax]
        movups  xmm1, xmmword ptr [rdx + 4*rax]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdi + 4*rax], xmm1
        add     rax, 4
        cmp     rax, 64
        jne     .LBB0_1
        ret

Notice that without AVX, it can't use a memory-source operand for addps because there's no compile-time alignment guarantee.

clang8.0 -O3 -march=skylake fully unrolls with YMM vectors, like gcc with the same options.

Upvotes: 15

Michael Tyson
Michael Tyson

Reputation: 1488

It’s probably best to make this explicit, using Accelerate. In this case, vDSP_vadd will do the trick.

Upvotes: 1

Related Questions