NMsd
NMsd

Reputation: 11

Eigen: Slow access to columns of Matrix 4

I am using Eigen for operations similar to Cholesky update, implying a lot of AXPY (sum plus multiplication by a scalar) on the columns of a fixed size matrix, typically a Matrix4d. In brief, it is 3 times more expensive to access to the columns of a Matrix 4 than to a Vector 4.

Typically, the code below:

for(int i=0;i<4;++i )  L.col(0) += x*y[i];

is 3 times less efficient than the code below:

for(int i=0;i<4;++i )  l4 += x*y[i];

where L is typically a matrix of size 4, x, y and l4 are vectors of size 4.

Moreover, the time spent in the first line of code is not depending on the matrix storage organization (either RowMajor of ColMajor).

On a Intel i7 (2.5GHz), it takes about 0.007us for vector operations, and 0.02us for matrix operations (timings are done by repeating 100000 times the same operation). My application would need thousands of such operation in timings hopefully far below the millisecond.

Question: I am doing something improperly when accessing columns of my 4x4 matrix? Is there something to do to make the first line of code more efficient?

Full code used for timings is below:

#include <iostream>
#include <Eigen/Core>
#include <vector>
#include <sys/time.h>

typedef Eigen::Matrix<double,4,1,Eigen::ColMajor> Vector4;
//typedef Eigen::Matrix<double,4,4,Eigen::RowMajor,4,4> Matrix4;
typedef Eigen::Matrix<double,4,4,Eigen::ColMajor,4,4> Matrix4;

inline double operator- (  const struct timeval & t1,const struct timeval & t0)
{
  /* TODO: double check the double conversion from long (on 64x). */
  return double(t1.tv_sec - t0.tv_sec)+1e-6*double(t1.tv_usec - t0.tv_usec);
}

void sumCols( Matrix4 & L,
              Vector4 & x4,
              Vector4 & y)
{
  for(int i=0;i<4;++i )
    {
      L.col(0) += x4*y[i];
    }
}

void sumVec( Vector4 & L,
             Vector4 & x4,
             Vector4 & y)
{
  for(int i=0;i<4;++i )
    {
      //L.tail(4-i)  += x4.tail(4-i)*y[i];
      L            += x4          *y[i];
    }
}

int main()
{
  using namespace Eigen;

  const int NBT = 1000000;

  struct timeval t0,t1;

  std::vector<     Vector4> x4s(NBT);  
  std::vector<     Vector4> y4s(NBT);  
  std::vector<     Vector4> z4s(NBT);  
  std::vector<     Matrix4> L4s(NBT);  

  for(int i=0;i<NBT;++i) 
  {
    x4s[i] = Vector4::Random();
    y4s[i] = Vector4::Random();
    L4s[i] = Matrix4::Random();
  }

  int    sample = int(z4s[55][2]/10*NBT);
  std::cout << "*** SAMPLE = " << sample << std::endl;

  gettimeofday(&t0,NULL);
  for(int i=0;i<NBT;++i) 
    {
      sumCols(L4s[i], x4s[i], y4s[i]);
    }
  gettimeofday(&t1,NULL);
  std::cout << (t1-t0) << std::endl;
  std::cout << "\t\t\t\t\t\t\tForce check" <<  L4s[sample](1,0) << std::endl;

  gettimeofday(&t0,NULL);
  for(int i=0;i<NBT;++i) 
    {
      sumVec(z4s[i], x4s[i], y4s[i]);
    }
  gettimeofday(&t1,NULL);
  std::cout << (t1-t0) << std::endl;
  std::cout << "\t\t\t\t\t\t\tForce check" <<  z4s[sample][2] << std::endl;

  return -1;
}

Upvotes: 1

Views: 310

Answers (1)

ggael
ggael

Reputation: 29205

As I said in a comment, the generated assembly is exactly the same for both functions.

The problem is that your benchmark is biased in the sense that L4s is 4 times bigger than z4s, and you thus get more cache misses in the matrix case than in the vector case.

Upvotes: 1

Related Questions