How to vectorize this loop with OpenMP simd?

Question

I have the following code which takes as input an x and fills the output vector y, which I would like to vectorize with OpenMP's simd directive:

for (size_t qi = 0; qi < nq; ++qi){
    const auto ii = face_indices[qi*3], jj = face_indices[qi*3+1], kk = face_indices[qi*3+2];
    y[ii] += x[kk] * fpl[ii] * fq[qi] * fpr[kk] - x[jj] * fpl[ii] * fq[qi] * fpr[jj];
    y[kk] += x[ii] * fpl[kk] * fq[qi] * fpr[ii] - x[jj] * fpl[kk] * fq[qi] * fpr[jj]; 
    y[jj] -= x[ii] * fpl[jj] * fq[qi] * fpr[ii] + x[kk] * fpl[jj] * fq[qi] * fpr[kk]; 
}

Aside from y, all of the input arrays (fq, fpl, fpr, face_indices) are const / read-only. For context, x, y, fpl, and fpr are all of size np whereas fq has size nq.

In practice, nq is on the order of hundreds of thousands (sometimes millions), and since face_indices has size 3*nq so I would like to vectorize the code with something OpenMP's simd instruction. For reference, the code is implementing a structured matrix-vector multiplication code derived from a particular type of Laplacian operator.

Though I get a version that compiles by adding the #pragma via something like:

#pragma omp simd 
for (size_t qi = 0; qi < nq; ++qi){
    ...
}

The issue is that the resulting vectorization is wrong: it doesn't match the version compiled without the pragma. Here's a godbolt link giving a basic example of the behavior. Moreover, on much larger test cases, the speed-up is relatively small (only ~2-3x).

So I have two questions:

How to fix the code to make the simd version match the non-simd version?
Is there any obvious optimizations I can make to make the simd version faster?

I was surprised when the behavior didn't match the non-simd version, because loop is completely isolated / has no backward dependencies: the inputs on the rhs are all const, and the output y is the only thing being modified. Moreover, it's safe to assume ii, jj, and kk are all distinct each iteration.

EDIT:

I've now realized that if the simd-length is $k$, it is indeed unsafe to multiple assignments to the same location in y---that is, it's not enough to ensure the values ii, jj, and kk are all distinct, it looks like I need to ensure they are all distinct for each $k$-unrolled/vectorized iteration in face_indices.

So, for example, using the following input for face_indices is unsafe unless simdlen(1) is specified:

face_indices = { 0,1,4,0,2,5,0,3,7,1,3,8,2,3,10,4,7,8,5,7,10 }

In contrast, the following re-ordering works with simdlen(2):

face_indices = { 0,1,4,5,7,10,0,2,5,1,3,8,2,3,10,4,7,8,0,3,7 }

Because every contiguous set of 6 indices are all distinct

J&#233;r&#244;me Richard · Accepted Answer

How to fix the code to make the simd version match the non-simd version?

The constraints implied by omp simd are AFAIK not clearly stated in the documentation so it is dependent of the target platform (target architecture, compiler, OS). This is confirmed by this previous post. That being said, one general assumption is that iterations of a SIMD chunk must all be fully independent. This means not only all the ii, jj, and kk must be individually different in a chunk, but also that ii, jj and kk must not be the same for a given iteration. I strongly advise you to check if this is true since it is the most probable explanation. The loop cannot be vectorized using most OpenMP implementation as long as this assumption is broken (it would not be efficient otherwise anyway).

Is there any obvious optimizations I can make to make the simd version faster?

The main issue is that this code is actually not SIMD-friendly.

First of all, memory access like face_indices[qi*3] results in a non-contiguous SIMD load because of the *3. Such access are generally significantly less efficient than contiguous one. Compilers can sometimes generate a not too horrible assembly code based on shuffles for this specific case though (because the stride is small and known at compile time).

Moreover, ii, jj and kk are runtime-defined indices that are not guaranteed to be uniform or linear. This means accesses like x[kk] are not contiguous either. In such a case, compilers cannot use shuffle instructions in this case. In SSE (x86), they can only do slow scalar loads. In AVX (x86), there is a gather instruction designed for this use-case. However, it is generally not efficiently implemented on mainstream x86 processors (it is faster than using scalar loads to fill SIMD registers but the processor still does independent scalar loads internally). On Neon (ARM), there is AFAIK no gather instruction. On SVE (ARM), there is a gather instruction but very few processors implement SVE so far. Since your code is certainly mainly limited by the memory access pattern, I do not expect this to be much faster using SIMD instructions.

Additionally, in the provided godbolt link, the instruction set is unspecified. This means SSE2 is used by default (since the target architecture is x86-64). SSE2 is rather obsolete nowadays. It has been super-seeded by AVX/AVX2 and more recently by AVX-512 (only supported by quite-recent x86-64 processors or server ones -- >=IceLake-Client, >=Skylake-Server and >=Zen4). You can use flags like -march=native (not portable) or -mavx2. Note that Clang does not use gather instructions in AVX/AVX2 but do it in AVX-512.

On top of that, you can factorize operations like fpl[ii] * fq[qi], fpl[kk] * fq[qi], and fpl[jj] * fq[qi] which are each computed twice. You might think that compilers should be able to do that automatically, but floating-point math is not associative so they cannot do that *. Thus, compilers need to compute tmp = x[kk] * fpl[ii] first and then tmp * fq[qi]. This is sub-optimal since x[kk] * fpl[ii] is not re-computed (the same thing applies to other sub-expressions).

To conclude, unless you can modify the code so to make ii, jj, kk uniform or linear (at least one of them), this code cannot be efficiently vectorized on mainstream platforms. changing the layout of face_indices so to make access more SIMD-friendly should help a bit but I do not expect a big performance improvement from (only) this.

* Technically, the OpenMP SIMD directive can be sufficient for the target implementation to ignore floating-point associativity rules since this is required for the loop to be vectorized anyway, but I do not expect compilers to do so in this use-case. You can use the compiler flag -ffast-math to avoid this problem assuming this flag is safe in your case (e.g. dangerous when dealing with NaN or Inf values). See this related post and this one.

Additional notes

Note that on GPU, this code can be made more efficient (assuming iterations are actually fully independent). Indeed, some GPUs (like Nvidia ones) can do fast gather in a special kind of memory called shared memory (generally few dozens of KiB per streaming multiprocessors). Memory accesses are only fast if there is not much bank conflicts though.

How to vectorize this loop with OpenMP simd?

Answers (1)

Additional notes

Related Questions