Reputation: 27
I have tried these codes to test SIMD directive in OpenMP.
#include <iostream>
#include <sys/time.h>
#include <cmath>
#define N 4096
#define M 1000
using namespace std;
int main()
{
timeval start,end;
float a[N],b[N];
for(int i=0;i<N;i++)
b[i]=i;
gettimeofday(&start,NULL);
for(int j=0;j<M;j++)
{
#pragma omp simd
for(int i=0;i<N;i++)
a[i]=pow(b[i],2.1);
}
gettimeofday(&end,NULL);
int time_used=1000000*(end.tv_sec-start.tv_sec)+(end.tv_usec-start.tv_usec);
cout<<"time_used="<<time_used<<endl;
return 1;
}
But either I compiled it by
g++ -fopenmp simd.cpp
or
g++ simd.cpp
their reports for "time_used" are almost the same.It looks like the SIMD directive I used doesn't have any use? Thanks!
Additional questions: I replaced
a[i]=pow(b[i],2.1);
by
a[i]=b[i]+2.1;
and when I compile them by
g++ -fopenmp simd.cpp
the output of "time_used" is about 12000. When I compile them by
g++ simd.cpp
the output of "time_used" is about 12000,almost the same as before.
My computer: Haswell i5,8g RAM,ubuntu kylin 16.04,gcc 5.4.0
Upvotes: 0
Views: 1250
Reputation: 244772
The compiler can't auto-vectorize function calls. It can only vectorize specific arithmetic operations that can be done using SIMD instructions.
Therefore, you need a vector math library that implements the pow
function using SIMD instructions. Intel provides one. I'm not sure if pow
is one of the functions that it offers with vector optimizations, but I imagine it is. You should also beware that Intel's math library may not be optimal on AMD processors.
You claim that you tried changing the pow
function call to a simple addition, but didn't see any improvement in the results. I'm not quite sure how that is possible, because if you change the inner loop from:
a[i]=pow(b[i],2.1);
to, say:
a[i] += b[i];
or:
a[i] += (b[i] * 2);
then GCC, with optimizations enabled, notices that you never use the result and elides the entire thing. It was unable to perform this optimization with the pow
function call, because it didn't know whether the function had any other side-effects. However, with code that is visible to the optimizer, it can…well, optimize it. In some cases, it might be able to vectorize it. In this case, it was able to remove it entirely.
If you tried code where the optimizer removed this loop entirely, and you still didn't see an improvement on your benchmark scores, then clearly this is not a bottleneck in your code and you needn't worry about trying to vectorize it.
Upvotes: 1