Is this a good practice for vectorization?

Question

I'm trying to improve performance from this code by vectorizing this function:

inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
    double d = 0;
    for( int k = 0; k < n; k++ )
        d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
    return (float)d;
}

From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):

#pragma omp simd
for( int k = 0; k < n; k++ )
        d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;

Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:

double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
        p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
        p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
        p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
        p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
        d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
        d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
        d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
        d += p3[k];

Is this a good solution, or there is any better? Modern compilers (say gcc) are going to do this (or better) kind of optimizations (e.g. enabling -O3) by themselves (so there is actually no gain in performance)?

zam · Accepted Answer

This is generally bad HPC programming practice by several reasons:

These days you normally have to make your code as computationally dense as possible. To achieve that you need to have the higher Arithmetic Intensity (AI) for the loop whenever you can. For simplicity think of AI as ration of [amount of computations] divided by [number of bytes moved to/from memory in order to perform these computations]. By splitting loops you make AI for each loop much lower in your case, because you do not reuse the same bytes for different computations anymore.
(Vector- or Thread-) -Parallel Reduction (in your case by variable "d") has cost/overhead which you don't want to multiply by 8 or 10 (i.e. by number of loops produced by you by hands).
In many cases Intel/CGG compiler vectorization engines can make slightly better optimization when have several data fields of the same data object processed in the same loop body as opposed to splitted loops case.

There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:

There is more than one loop carried dependency or reduction in the same loop.
In case loop split helps to get rid of some out of order execution data flow dependencies.
In case you do not have enough vector registers to perform all the computations (for complicated algorithms).

Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.

This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly. However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.

Is this a good practice for vectorization?

Answers (1)

Related Questions