Reputation: 175
I have the following C++ code that multiply two array elements of a large size count
double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
lg += *pA1-- * *pA2--;
}
Is there a way that I can implement parallelism for the code?
Upvotes: 0
Views: 515
Reputation: 50668
Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3
. Clang 13.0 fail to do this, even with -ffast-math
and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx
to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.
Upvotes: 2
Reputation: 175
I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.
// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
int i;
# pragma omp parallel shared(sum)
{
int num = omp_get_num_threads();
int id = omp_get_thread_num();
printf("I am thread # % d of % d.\n", id, num);
double priv_sum = 0.;
# pragma omp for
for (i = 0; i < dim; i++)
{
priv_sum += v1[i] * v2[i];
}
#pragma omp critical
{
cout << "priv_sum = " << priv_sum << endl;
sum += priv_sum;
}
}
return sum;
}
Upvotes: 0