Why Eigen C++ with MKL doesn't use multi-threading for this large matrix multiplication?

Question

I am doing some calculations that include QR decomposition of a large number(~40000 in each execution) of 4x4 matrix with complex double elements (chosen from a random distribution). I started with directly writing the code using Intel MKL functions. But after some research, it seems like working with Eigen will be much simpler and result in easier to maintain code. (Partly because I find it difficult to work with 2d arrays in intel MKL & care needed for memory management).

Before shifting to Eigen, I started with some performance checks. I took a code(from an earlier similar question on SO) for multiplications of a 10000x100000 matrix with another 100000x1000 matrix (large size chosen to have the effect of parallelization ). I run it on a 36 core node . When I checked the stat, Eigen without Intel MKL directive (but compiled with -O3 -fopenmp) used all the cores and completed the task within ~7 sec.

On the other hand with,

#define EIGEN_USE_MKL_ALL
#define EIGEN_VECTORIZE_SSE4_2

the code takes 28 s & uses an only a single core. Here is my compilation instruction

g++ -m64 -std=c++17 -fPIC -c -I. -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 -o eigen_test.o eigen_test.cc
g++ -m64 -std=c++17 -fPIC -I.  -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 eigen_test.o -o eigen_test  -L/apps/IntelParallelStudio/linux/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_rt -lmkl_core -liomp5 -lpthread

The code is here,

//#define EIGEN_USE_MKL_ALL // Determine if use MKL
//#define EIGEN_VECTORIZE_SSE4_2

#include 
#include 
using namespace Eigen;
int main()
{

  int n_a_rows = 10000;
  int n_a_cols = 10000;
  int n_b_rows = n_a_cols;
  int n_b_cols = 1000;

  MatrixXi a(n_a_rows, n_a_cols);

  for (int i = 0; i < n_a_rows; ++ i)
      for (int j = 0; j < n_a_cols; ++ j)
        a (i, j) = n_a_cols * i + j;

  MatrixXi b (n_b_rows, n_b_cols);
  for (int i = 0; i < n_b_rows; ++ i)
      for (int j = 0; j < n_b_cols; ++ j)
        b (i, j) = n_b_cols * i + j;

  MatrixXi d (n_a_rows, n_b_cols);

  clock_t begin = clock();

  d = a * b;

  clock_t end = clock();
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "Time taken : " << elapsed_secs << std::endl;

}

In the previous question related to this topic, the difference in speed was found to be turbo boost (& difference was not such huge). I know for small matrices Eigen may work better than MKL. But I can't understand why Eigen+MKL refuses to use multiple cores even when I pass -liomp5 during compilation.

Thank you in advance . (CentOS 7 with GCC 7.4.0 eigen 3.4.0)

Kaveh Vahedipour · Accepted Answer

Please set the following shell variables before executing you program.

export MKL_NUM_THREADS="$(nproc)"
export OMP_NUM_THREADS="$(nproc)"

Also the build command (the first line run with $define EIGEN_USE_MKL_ALL commented out)

. /opt/intel/oneapi/setvars.sh
$CXX -I /usr/local/include/eigen3 eigen.cpp -o eigen_test -lblas
$CXX -I /usr/local/include/eigen3 eigen.cpp -o eigen_test -lmkl_rt

works fine with CXX as clang++, g++ and icpx. Setting the environment as shown above is important. In that case -lmkl_rt is plenty. A little bit of adjustment to the code gives you the net benefit in wall clock:

#define EIGEN_USE_BLAS
#define EIGEN_USE_MKL_ALL

#include 
#include 
#include 
using namespace Eigen;
using namespace std::chrono;
int main()
{

  int n_a_rows = 10000;
  int n_a_cols = 10000;
  int n_b_rows = n_a_cols;
  int n_b_cols = 1000;

  MatrixXd a(n_a_rows, n_a_cols);

  for (int i = 0; i < n_a_rows; ++ i)
      for (int j = 0; j < n_a_cols; ++ j)
        a (i, j) = n_a_cols * i + j;

  MatrixXd b (n_b_rows, n_b_cols);
  for (int i = 0; i < n_b_rows; ++ i)
      for (int j = 0; j < n_b_cols; ++ j)
        b (i, j) = n_b_cols * i + j;

  MatrixXd d (n_a_rows, n_b_cols);

  using wall_clock_t = std::chrono::high_resolution_clock;
  auto const start = wall_clock_t::now();
  clock_t begin = clock();

  d = a * b;

  clock_t end = clock();
  auto const wall = std::chrono::duration(wall_clock_t::now() - start);
  
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "CPU time : " << elapsed_secs << std::endl;
  std::cout << "Wall time : " << wall.count() << std::endl;
  std::cout << "Speed up : " << elapsed_secs/wall.count() << std::endl;

}

The runtime on my 8 core i7-4790K @4GHz shows perfect parallelisation:

With on board blas:

CPU time : 12.5134
Wall time : 1.69036
Speed up : 7.40277

With MKL:

> ./eigen_test
CPU time : 11.4391
Wall time : 1.52542
Speed up : 7.49898

Why Eigen C++ with MKL doesn't use multi-threading for this large matrix multiplication?

Answers (1)

Related Questions

Why Eigen C++ with MKL doesn&#39;t use multi-threading for this large matrix multiplication?

Answers (1)

Related Questions

Why Eigen C++ with MKL doesn't use multi-threading for this large matrix multiplication?