Reputation: 85
I am trying to optimise some code by use of AVX intrinsics. A very simple test case compiles but tells me that my loop was not vectorised for a number of reasons that I don't understand.
This is the full program, simple.c
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <immintrin.h>
int main(void)
{
__m256 * x = (__m256 *) calloc(1024,sizeof(__m256));
for (int j=0;j<32;j++)
x[j] = _mm256_set1_ps(1.);
return(0);
}
This is the command line: gcc simple.c -O1 -fopenmp -ffast-math -lm -mavx2 -ftree-vectorize -fopt-info-vec-missed
This is the output:
I have gcc version 5.4.
Can anyone help me to interpret these messages and to understand what is going on?
Upvotes: 3
Views: 605
Reputation: 364029
You're already manually vectorizing with intrinsics, so there's nothing left for gcc to auto-vectorize. This leads to uninteresting warnings, I assume from trying to auto-vectorize the intrinsic or the loop-counter increments.
I get good asm from gcc 5.3 (on the Godbolt compiler explorer) if I don't do something silly like write a function that will optimize away, or try to compile it with only -O1
.
#include <immintrin.h>
void set_to_1(__m256 * x) {
for (int j=0;j<32;j++)
x[j] = _mm256_set1_ps(1.);
}
push rbp
lea rax, [rdi+1024]
vmovaps ymm0, YMMWORD PTR .LC0[rip]
mov rbp, rsp
push r10 # gcc is weird with r10 in functions with ymm vectors
.L2: # this is the vector loop
vmovaps YMMWORD PTR [rdi], ymm0
add rdi, 32
cmp rdi, rax
jne .L2
vzeroupper
pop r10
pop rbp
ret
.LC0:
.long 1065353216
... repeated several times because gcc failed to use a vbroadcastss load or generate the constant on the fly
I do actually get nearly the same asm from -O1
, but using -O1
to not optimize things away isn't a good way to see what gcc will really do.
Upvotes: 3