Reputation: 6430
I have the following loop:
float* s;
float* ap;
float* bp;
... // initialize s, ap, bp
for(size_t i=0;i<64;++i) {
s[i] = ap[i]+bp[i];
}
Seems like a good candidate for vectorization. Though I have optimization turned on, when I look at the assembly output, clang (I'm using Xcode) seems to not have vectorized the loop:
LBB33_1: ## =>This Inner Loop Header: Depth=1
movss (%rax,%rsi,4), %xmm0 ## xmm0 = mem[0],zero,zero,zero
addss (%rcx,%rsi,4), %xmm0
movss %xmm0, (%rdx,%rsi,4)
Ltmp353:
incq %rsi
Ltmp354:
cmpq $64, %rsi
Ltmp355:
jne LBB33_1
How can I get clang/Xcode to vectorize this simple loop?
Upvotes: 6
Views: 7135
Reputation: 365737
Use a non-ancient version of clang/LLVM. Apple clang/LLVM is different from mainline clang/LLVM, but they share a common codebase.
Mainline clang3.3 and newer auto-vectorize your loop at -O3
. Clang3.4 and newer auto-vectorize it even at -O2
.
Without restrict
, clang does emit asm that checks for overlap between the destination and the two sources (with a fallback to scalar), so you'll get more efficient asm from float *restrict s
.
#include <stdlib.h>
void add_float_good(float *restrict s, float *restrict ap, float *restrict bp)
{
for(size_t i=0;i<64;++i) {
s[i] = ap[i]+bp[i];
}
}
compiles with clang3.4 -O3 (on the Godbolt compiler explorer) to this simplistic asm with the worst of indexed addressing modes and loop overhead, but at least it's vectorized. Newer clang likes to unroll, especially when tuning for recent Intel (e.g. -march=skylake
)
# clang3.4 -O3
add_float_good:
xor eax, eax
.LBB0_1: # %vector.body
movups xmm0, xmmword ptr [rsi + 4*rax]
movups xmm1, xmmword ptr [rdx + 4*rax]
addps xmm1, xmm0
movups xmmword ptr [rdi + 4*rax], xmm1
add rax, 4
cmp rax, 64
jne .LBB0_1
ret
Notice that without AVX, it can't use a memory-source operand for addps
because there's no compile-time alignment guarantee.
clang8.0 -O3 -march=skylake
fully unrolls with YMM vectors, like gcc with the same options.
Upvotes: 15
Reputation: 1488
It’s probably best to make this explicit, using Accelerate. In this case, vDSP_vadd will do the trick.
Upvotes: 1