Tim
Tim

Reputation: 83

Eigen C++: Performance of sparse-matrix manipulations

Can anybody explain the following behavior of Eigen sparse matrices? I have been looking into aliasing and lazy evaluation, but I don't seem to be able to improve on the issue. Tech specs: I am using the latest Eigen stable release on Ubuntu 16.10 with g++ compiler and no optimization flags.

Suppose I define a simple identity in the following way:

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

then perform these operations with it

spIdent-spIdent;
spIdent*spIdent;
spIdent - spIdent*spIdent;

and measure the computation times for all three. What I get is this

0 Computation time: 2.6e-05
1 Computation time: 2e-06 
2 Computation time: 1.10706

Meaning that either operation is fast, but the combination is super slow. The noalias() method is only defined for dense matrices, plus in my dense example it did not make much of a difference. Any enlightenment?

MCVE:

#include <iostream>
#include <ctime>
#include "../Eigen/Sparse"

using namespace std;
using namespace Eigen;

int main() {

unsigned int N=2000000;

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

clock_t start=clock();
spIdent*spIdent;
cout << "0 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent-spIdent;
cout << "1 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent - (spIdent*spIdent);
cout << "2 Computation time: " << float(clock() - start)/1e6 << '\n';

return 0;

}

Upvotes: 3

Views: 1569

Answers (3)

Avi Ginsburg
Avi Ginsburg

Reputation: 10596

It's not so much that it gets optimized out as much as the lazy evaluation is, well, very lazy. Look at the product. The code that's called is (at least in whatever version of Eigen that was included on this machine):

template<typename Derived>
template<typename OtherDerived>
inline const typename SparseSparseProductReturnType<Derived,OtherDerived>::Type
SparseMatrixBase<Derived>::operator*(const SparseMatrixBase<OtherDerived> &other) const
{
  return typename SparseSparseProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived());
}

which returns an expression of the product (i.e. lazy). Nothing is done with this expression, so the cost is zero. The same goes for the difference. Now, when doing a-a*a the a*a is an expression. It then meets the operator-. This sees the expression on the right hand side. This expression is then evaluated to a temporary (i.e. costs time) in order to use it in operator-. Why evaluate to a temporary? Read this for their logic (search for "The second circumstance").

operator- is a CwiseBinaryOp with the product expression as the right hand side. The first thing CwiseBinaryOp does is assign the right hand side to a member:

EIGEN_STRONG_INLINE CwiseBinaryOp(const Lhs& aLhs, const Rhs& aRhs, const BinaryOp& func = BinaryOp())
      : m_lhs(aLhs), m_rhs(aRhs), m_functor(func)

(m_rhs(aRhs)) which in turn calls a SparseMatrix constructor:

/** Constructs a sparse matrix from the sparse expression \a other */
template<typename OtherDerived>
inline SparseMatrix(const SparseMatrixBase<OtherDerived>& other)
  : m_outerSize(0), m_innerSize(0), m_outerIndex(0), m_innerNonZeros(0)
{
  ...
  *this = other.derived();
}

which in turn calls operator= which will (somebody correct me if I'm wrong) always trigger an evaluation, in this case, to a temporary.

Upvotes: 4

AlQuemist
AlQuemist

Reputation: 1304

I think as @hfhc2 has mentioned, the first two statements in the code get optimized away entirely by the compiler (since the results are not needed in the rest). In the third statement, an auxiliary intermediate variable is most likely produced to store the temporary result of spIdent*spIdent. To see this clearly, consider the following example which includes explicit copy-assignments:

#include <iostream>
#include <ctime>
#include <Eigen/Sparse>

using namespace std;
using namespace Eigen;

int main () {

   const unsigned int N = 2000000;

   SparseMatrix<double> spIdent(N,N);
   SparseMatrix<double> a(N,N), b(N,N), c(N,N);

   spIdent.reserve(N);
   spIdent.setIdentity();

   clock_t start = clock();
   a = spIdent*spIdent;
   cout << "0 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   b = spIdent-spIdent;
   cout << "1 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   c = a - b;
   cout << "2 Computation time: " << float(clock() - start)/1e6 << endl;

   return 0;

} 

The measured times (without compiler optimizations) are [for openSUSE 12.2 (x86_64), g++ 4.7.1, Intel 2core 2GHz CPU]:

0 Computation time: 1.58737
1 Computation time: 0.417798
2 Computation time: 0.428174

which seems quite reasonable.

Upvotes: 3

hfhc2
hfhc2

Reputation: 4391

Well, as people mentioned, in the first two statements the code gets optimized away entirely (I have tested with a current version of g++ and -O3 set). The disassembly shows this for the second statement:

  400e78:   e8 03 fe ff ff          callq  400c80 <clock@plt>   # timing begins
  400e7d:   48 89 c5                mov    %rax,%rbp
  400e80:   e8 fb fd ff ff          callq  400c80 <clock@plt>   # timing ends

Fop the third part there is actually something happening an Eigen library code is called:

  400ede:   e8 9d fd ff ff          callq  400c80 <clock@plt>   # timing begins
  400ee3:   48 89 c5                mov    %rax,%rbp
  400ee6:   8b 44 24 58             mov    0x58(%rsp),%eax
  400eea:   39 44 24 54             cmp    %eax,0x54(%rsp)
  400eee:   c6 44 24 20 00          movb   $0x0,0x20(%rsp)
  400ef3:   48 89 5c 24 28          mov    %rbx,0x28(%rsp)
  400ef8:   48 89 5c 24 30          mov    %rbx,0x30(%rsp)
  400efd:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
  400f04:   00 00 
  400f06:   c6 44 24 40 01          movb   $0x1,0x40(%rsp)
  400f0b:   0f 85 99 00 00 00       jne    400faa <main+0x22a>
  400f11:   48 8d 4c 24 1f          lea    0x1f(%rsp),%rcx
  400f16:   48 8d 54 24 20          lea    0x20(%rsp),%rdx
  400f1b:   48 8d bc 24 90 00 00    lea    0x90(%rsp),%rdi
  400f22:   00 
  400f23:   48 89 de                mov    %rbx,%rsi
  400f26:   e8 25 1a 00 00          callq  402950 <_ZN5Eigen13CwiseBinaryOpINS_8internal20scalar_difference_opIdEEKNS_12SparseMatrixIdLi0EiEEKNS_19SparseSparseProductIRS6_S8_EEEC1ES8_RSA_RKS3_>
  400f2b:   48 8d bc 24 a0 00 00    lea    0xa0(%rsp),%rdi
  400f32:   00 
  400f33:   e8 18 02 00 00          callq  401150 <_ZN5Eigen12SparseMatrixIdLi0EiED1Ev>
  400f38:   e8 43 fd ff ff          callq  400c80 <clock@plt>   # timing ends

I guess that in this case the compiler is not able to figure out that the result of the computation is not used, as opposed to the first two cases.

If you take a look at the documentation, then you can see that a simple operation like + on a sparse matrix returns not a matrix but a CwiseUnaryOp representing the result. I figure that if you don't use this class somewhere, then the resulting matrix is never constructed.

Upvotes: 4

Related Questions