Why OpenMP 'simd' has better performance than 'parallel for simd'?

Question

I'm working on a Intel E5 (6 cores, 12 threads) with intel compiler OpenMP 4.0

Why is this piece of code SIMD-ed quicker than parallel SIMD-ed?

for (int suppv = 0; suppv < sSize; suppv++) {
  Value *gptr = &grid[gind];
  const Value * cptr = &C[cind];

  #pragma omp simd // vs. #pragma omp parallel for simd
  for (int suppu = 0; suppu < sSize; suppu++)
    gptr[suppu] += d * cptr[suppu];

  gind += gSize;
  cind += sSize;
}

And with more threads, it becomes slower.

Edit 1: * grid is a 4096*4096 matrix, data structure: vector> * C is a 2112*129*129 matrix, data structure: vector> * gSize = 4096 * sSize = 129.

Compiler flags: icpc -march=native -std=c++11 -qopt-report-phase=vec -qopt-report=3 -O2 -openmp
Timer: use POSIX times() API's return value diff. (It does use wall clock for concurrency, I did the check)
E5 thread 1 SIMD takes: 291.520000 (s)
E5 thread 2 for-SIMD takes: 1039.220000 (s)
E5 thread 12 for-SIMD takes: 1684.270000 (s)

a3mlord · Accepted Answer

If sSize = 129, as you have in your edit, then the overhead of parallelizing the loop doesn't pay off. This would be easier to confirm if you'd show us the numbers of the sequential implementation (no SIMD) and the pure parallel implementation (i.e. with #pragma omp parallel for but no SIMD).

What is likely happening is that even the pure parallel version is slower than the sequential one. Not only the loop size is reduced, as you launch/create a parallel zone for every iteration of the outer-most loop.

As for the SIMD version, this problem is essentially tailored for that: you have a highly vectorizable kernel that would be too small to distribute among threads.

Why OpenMP 'simd' has better performance than 'parallel for simd'?

Answers (1)

Related Questions

Why OpenMP &#39;simd&#39; has better performance than &#39;parallel for simd&#39;?

Answers (1)

Related Questions

Why OpenMP 'simd' has better performance than 'parallel for simd'?