Reputation: 361
I wrote a C++ code to solve a linear system A.x = b
where A
is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T
with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b
where of course V
is orthogonal.
Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.
Here is my naive implementation:
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int i=0; i<N; i++)
{
for (int j=0; j<N; j++)
{
for (int k=0; k<N; k++)
{
X[i] += B[j] * V[i+k*N] * V[j+k*N] / D[k];
}
}
}
}
All arrays are C-style arrays, where V
is of size N^2
, D
is of size N
, B
is of size N
and X
is of size N
(and initialized with zeros).
For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !
Thanks
EDIT Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.
I found the #pragma omp parallel for simd
clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000
and N=2000
.
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
double* sum = new double[N]{0.};
cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);
#pragma omp parallel for simd
for (int i=0; i<N; ++i)
{
sum[i] /= D[i];
}
cblas_dgemv(CblasColMajor,CblasNoTrans,N,N,1.,V,N,sum,1,0.,X,1);
delete [] sum;
}
Upvotes: 1
Views: 341
Reputation: 50358
This code is currently highly memory-bound. Thus the resulting program will probably poorly scale (as long as compiler optimization are enabled). Indeed, on most common systems (eg. 1 socket non-NUMA processor) the RAM throughput is a shared resource between cores and also a scarce one. Moreover, the memory access pattern is inefficient and the algorithmic complexity of the code can be improved.
To speed up the computation, the j and k loops can be swapped so that V
is read contiguously. Moreover, the division by V[i+k*N]
and D[k]
becomes constants in the most inner loop. The computation can then be factorized to be much faster since B[j]
and V[j+k*N]
is not dependent of i
too. The resulting algorithm runs in O(n^2)
rather than O(n^3)
thanks to sum precomputations !
Finally, omp simd
can be used to help compilers vectorizing the code, making it even faster!
Note that _OPENMP
seems useless here since #pragma
should ignored by compilers when OpenMP is disabled or not supported.
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
std::vector<double> kSum(N);
#pragma omp parallel for
for (int k=0; k<N; k++)
{
const double sum = 0.0;
#pragma omp simd reduction(+:sum)
for (int j=0; j<N; j++)
{
sum += B[j] * V[j+k*N];
}
kSum[k] = sum / D[k];
}
// Loop tiling can be used to speed up this section even more.
// The idea is to swap i-based and j-based loops and work on thread-private copies
// of X and finally sum the thread-private versions into a global X.
// The resulting code should work on contiguous data and can even be vectorized.
#pragma omp parallel for
for (int i=0; i<N; i++)
{
double sum = X[i];
for (int k=0; k<N; k++)
{
sum += kSum[k] * V[i+k*N];
}
X[i] = sum;
}
}
The new code should be several order of magnitude faster than the original one (but still memory-bound). Note that results might be a bit different (as floating-point operations are not really associative), but I expect results to be be more accurate.
Upvotes: 2