Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP

Question

I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b where of course V is orthogonal.

Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.

Here is my naive implementation:

// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
    #ifdef _OPENMP
    #pragma omp parallel for
    #endif
    for (int i=0; i


All arrays are C-style arrays, where V is of size N^2, D is of size N, B is of size N and X is of size N (and initialized with zeros).
For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !
Thanks
EDIT
Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.
I found the #pragma omp parallel for simd clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000 and N=2000.
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{

    double* sum = new double[N]{0.};

    cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);

    #pragma omp parallel for simd
    for (int i=0; i

J&#233;r&#244;me Richard · Accepted Answer

This code is currently highly memory-bound. Thus the resulting program will probably poorly scale (as long as compiler optimization are enabled). Indeed, on most common systems (eg. 1 socket non-NUMA processor) the RAM throughput is a shared resource between cores and also a scarce one. Moreover, the memory access pattern is inefficient and the algorithmic complexity of the code can be improved.

To speed up the computation, the j and k loops can be swapped so that V is read contiguously. Moreover, the division by V[i+k*N] and D[k] becomes constants in the most inner loop. The computation can then be factorized to be much faster since B[j] and V[j+k*N] is not dependent of i too. The resulting algorithm runs in O(n^2) rather than O(n^3) thanks to sum precomputations !

Finally, omp simd can be used to help compilers vectorizing the code, making it even faster!

Note that _OPENMP seems useless here since #pragma should ignored by compilers when OpenMP is disabled or not supported.

// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
    std::vector kSum(N);

    #pragma omp parallel for
    for (int k=0; k


The new code should be several order of magnitude faster than the original one (but still memory-bound). Note that results might be a bit different (as floating-point operations are not really associative), but I expect results to be be more accurate.

Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP

Answers (1)

Related Questions