Reputation: 5570
I'm trying to use OpenMP to split a for loop computation to multiple threads. Additionally, I'm trying to instruct the compiler to vectorize each chunk assigned to each thread. The code is the following:
#pragma omp for private(i)
__pragma(loop(ivdep))
for (i = 0; i < 4096; i++)
vC[i] = vA[i] + SCALAR * vB[i];
The problem is that both pragmas expect the for loop right after.
Is there any smart construct to make this work?
Some might argue that due to the for loop splitting with OpenMP, the vectorization of the loop won't work. However I read that the #pragma omp for divides the loop into a number of contiguous chunks equal to the thread count. Is thitt?
Upvotes: 0
Views: 407
Reputation: 9489
What about using #pragma omp for simd private(i)
instead of the pragma + __pragma() ?
Edit: since OpenMP 4 doesn't seem to be an option for you, you can manually split your loop to get rid of the #pragma omp for
by just computing the index limits by hand using omp_get_num_threads()
and omp_get_thread_num()
, and then keep the ivdep
for the per-thread loop.
Edit 2: since I'm a nice guy and since this is boilerplate (more common when programming in MPI, but still) but quite annoying to get right when you do it for the first time, here is a possible solution:
#pragma omp parallel
{
int n = 4096;
int tid = omp_get_thread_num();
int nth = omp_get_num_threads();
int chunk = n / nth;
int beg = tid * chunk + min( tid, n % nth );
int end = ( tid + 1 ) * chunk + min( tid + 1, n % nth );
#pragma ivdep
for ( int i = beg; i < end; i++ ) {
vC[i] = vA[i] + SCALAR * vB[i];
}
}
Upvotes: 2