Reputation: 6828
I'm trying to improve performance from this code by vectorizing this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):
#pragma omp simd
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:
double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p3[k];
Is this a good solution, or there is any better? Modern compilers (say gcc
) are going to do this (or better) kind of optimizations (e.g. enabling -O3
) by themselves (so there is actually no gain in performance)?
Upvotes: 2
Views: 104
Reputation: 1684
This is generally bad HPC programming practice by several reasons:
There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:
Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.
This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly. However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.
Upvotes: 1