Reputation: 1744
I'm working on a Intel E5 (6 cores, 12 threads) with intel compiler OpenMP 4.0
Why is this piece of code SIMD-ed quicker than parallel SIMD-ed?
for (int suppv = 0; suppv < sSize; suppv++) {
Value *gptr = &grid[gind];
const Value * cptr = &C[cind];
#pragma omp simd // vs. #pragma omp parallel for simd
for (int suppu = 0; suppu < sSize; suppu++)
gptr[suppu] += d * cptr[suppu];
gind += gSize;
cind += sSize;
}
And with more threads, it becomes slower.
Edit 1:
* grid
is a 4096*4096 matrix, data structure: vector<complex<double>>
* C
is a 2112*129*129
matrix, data structure: vector<complex<double>>
* gSize = 4096
* sSize = 129.
Timer: use POSIX times() API's return value diff. (It does use wall clock for concurrency, I did the check)
E5 thread 1 SIMD takes: 291.520000 (s)
Upvotes: 0
Views: 2223
Reputation: 1060
If sSize
= 129, as you have in your edit, then the overhead of parallelizing the loop doesn't pay off. This would be easier to confirm if you'd show us the numbers of the sequential implementation (no SIMD) and the pure parallel implementation (i.e. with #pragma omp parallel for
but no SIMD).
What is likely happening is that even the pure parallel version is slower than the sequential one. Not only the loop size is reduced, as you launch/create a parallel zone for every iteration of the outer-most loop.
As for the SIMD version, this problem is essentially tailored for that: you have a highly vectorizable kernel that would be too small to distribute among threads.
Upvotes: 5