Reputation: 2147
Edit: My first code sample was wrong. Fixed with a simpler.
I implement a C++ library for algebraic operations between large vectors and matrices. I found on x86-x64 CPUs that OpenMP parallel vector additions, dot product etc are not going so faster than single threaded. Parallel operations are -1% - 6% faster than single threaded. This happens because of memory bandwidth limitation (I think).
So, the question is, is there real performance benefit for code like this:
void DenseMatrix::identity()
{
assert(height == width);
size_t i = 0;
#pragma omp parallel for if (height > OPENMP_BREAK2)
for(unsigned int y = 0; y < height; y++)
for(unsigned int x = 0; x < width; x++, i++)
elements[i] = x == y ? 1 : 0;
}
In this sample there is no serious drawback from using OpenMP. But if I am working on OpenMP with Sparse Vectors and Sparse Matrices, I cannot use for instance *.push_back(), and in that case, question becomes serious. (Elements of sparse vectors are not continuous like dense vectors, so parallel programming has a drawback because result elements can arrive anytime - not for lower to higher index)
Upvotes: 0
Views: 640
Reputation: 9070
I don't think this is a problem of memory bandwidth. I see clearly a problem on r
: r
is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.
I'm wondering whether you can get even the correct answer, because there are data races on r
. Did you get the correct answer?
However, the solution would be very simple. The operation conducted on r
is reduction, which can be easily achieved by reduction
clause of OpenMP.
Try to simply append reduction(+ : r)
after #pragma omp parallel
.
(Note: Additions on double
are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)
Upvotes: 1