Reputation: 8430
I have the following C program (a simplification of my actual use case which exhibits the same behavior)
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ const input = malloc(20000*sizeof(float));
float * __restrict__ const output = malloc(20000*sizeof(float));
unsigned int pos=0;
while(1) {
unsigned int rest=100;
for(unsigned int i=pos;i<pos+rest; i++) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
When I compile with
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math
I get the output
main.c:10: note: not vectorized: unhandled data-ref
where 10 is the line of the inner for loop. When I looked up why it might say this, it seemed to be saying that the pointers could be aliased, but they can't be in my code, as I have the __restrict keyword. They also suggested including the -msse flags, but they don't seem to do anything either. Any help?
Upvotes: 5
Views: 6524
Reputation: 239011
It certainly seems like a bug. In the following, equivalent functions, foo()
is vectorised but bar()
is not, when compiling for an x86-64 target:
void foo(const float * restrict input, float * restrict output)
{
unsigned int pos;
for (pos = 0; pos < 10100; pos++)
output[pos] = input[pos] * 0.1;
}
void bar(const float * restrict input, float * restrict output)
{
unsigned int pos;
unsigned int i;
for (pos = 0; pos <= 10000; pos += 100)
for (i = 0; i < 100; i++)
output[pos + i] = input[pos + i] * 0.1;
}
Adding the -m32
flag, to compile for an x86 target instead, causes both functions to be vectorised.
Upvotes: 3
Reputation: 51445
try:
const float * __restrict__ input = ...;
float * __restrict__ output = ...;
experiment a bit by changing things around:
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ input = new float[20000];
float * __restrict__ output = new float[20000];
unsigned int pos=0;
while(1) {
unsigned int rest=100;
output += pos;
input += pos;
for(unsigned int i=0;i<rest; ++i) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp
test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21
test.cpp:14: note: Alignment of access forced using versioning.
test.cpp:14: note: Vectorizing an unaligned access.
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware.
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing.
test.cpp:14: note: Cost model analysis:
Vector inside of loop cost: 8
Vector outside of loop cost: 6
Scalar iteration cost: 5
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 2
test.cpp:14: note: Profitability threshold = 3
test.cpp:14: note: Vectorization may not be profitable.
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21
test.cpp:14: note: created 1 versioning for alias checks.
test.cpp:14: note: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.
Compilation finished at Wed Feb 16 19:17:59
Upvotes: 1
Reputation: 93690
It doesn't like the outer loop format which is preventing it from understanding the inner loop. I can get it to vectorize if I just fold it into a single loop:
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ input = malloc(20000*sizeof(float));
float * __restrict__ output = malloc(20000*sizeof(float));
for(unsigned int i=0; i<=10100; i++) {
output[i] = input[i] * 0.1f;
}
}
(note that I didn't think too hard about how to properly translate the pos+rest limit into a single for loop condition, it may be wrong)
You might be able to take advantage of this by putting a simplified inner loop into a function which you call with pointers and a count. Even when it is inlined again it may work fine. This is assuming you deleted parts of your while()
loop that I have just simplified away but you need to retain.
Upvotes: 2