Adam
Adam

Reputation: 1744

Why OpenMP 'simd' has better performance than 'parallel for simd'?

I'm working on a Intel E5 (6 cores, 12 threads) with intel compiler OpenMP 4.0

Why is this piece of code SIMD-ed quicker than parallel SIMD-ed?

for (int suppv = 0; suppv < sSize; suppv++) {
  Value *gptr = &grid[gind];
  const Value * cptr = &C[cind];

  #pragma omp simd // vs. #pragma omp parallel for simd
  for (int suppu = 0; suppu < sSize; suppu++)
    gptr[suppu] += d * cptr[suppu];

  gind += gSize;
  cind += sSize;
}

And with more threads, it becomes slower.


Edit 1: * grid is a 4096*4096 matrix, data structure: vector<complex<double>> * C is a 2112*129*129 matrix, data structure: vector<complex<double>> * gSize = 4096 * sSize = 129.

Upvotes: 0

Views: 2223

Answers (1)

a3mlord
a3mlord

Reputation: 1060

If sSize = 129, as you have in your edit, then the overhead of parallelizing the loop doesn't pay off. This would be easier to confirm if you'd show us the numbers of the sequential implementation (no SIMD) and the pure parallel implementation (i.e. with #pragma omp parallel for but no SIMD).

What is likely happening is that even the pure parallel version is slower than the sequential one. Not only the loop size is reduced, as you launch/create a parallel zone for every iteration of the outer-most loop.

As for the SIMD version, this problem is essentially tailored for that: you have a highly vectorizable kernel that would be too small to distribute among threads.

Upvotes: 5

Related Questions