Auto-vectorization of scalar product in loop

Question

I am trying to autovectorize the following loop. In the following we loop with the i- and j-loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep and compiler option -D NOALIAS) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.

It could be, that the if-statement is a problem, but even with the if-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.

__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
    for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
        double __attribute__((aligned(64))) scalarp = 0.0;
        #pragma omp simd
        for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
            // scalar product of \sum_k x_{i,k} \cdot x_{j,k}
            scalarp += x[i*D + k] * x[j*D + k];
        }

        // Alternative to following if:
        // accum +=  - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
        #pragma ivdep
        if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
            accum += 1;
        }
    }
}

Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd clause and Intels auto-vectorization is aware of that.

Intel Compiler: 18.0.2 20180210

edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.

boraas · Accepted Answer

Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED takes also care of inner loop in this particular case.

Auto-vectorization of scalar product in loop

Answers (1)

Related Questions