Performance drops when running Eigen dense matrix multiplications over multiple cores on ARM / Raspberry PI

Question

I have found a significant drop of performance when running Eigen dense matrix multiplication on 2 or 3 threads in parallel on ARM 32 or 64 bits on Raspberry PI 4.

I can't understand this issue because RPI 4 has 4 cores and theoretically can deal with up to 4 threads in real parallelism. Furthermore, I cannot reproduce the problem on my laptop (Intel I9 4-core processor), where each thread keeps the same performance regardless I am running one, 2 or 3 threads in parallel.

In my experiments (see this repo for details), I'm running different threads on 4-core Raspberry Pi 4 buster. The problem occurs on both 32 and 64 bits versions. In order to illustrate the issue, I wrote a program where each thread holds its own data and then process dense matrix multiplication using the own data as an totally individual unit of processing:

void worker(const std::string & id) {

  const MatrixXd A = 10 * MatrixXd::Random(size, size);
  const MatrixXd B = 10 * MatrixXd::Random(size, size);
  MatrixXd C;
  double test = 0;

  for (int step = 0; step < 30; ++step) {
      test += foo(A, B, C);
  }

  std::cout << "test value is:" << test << "
";

}

where foo is simply a loop with 100 matrix multiplication calls:

const int size = 512;

float foo(const MatrixXd &A, const MatrixXd &B, MatrixXd &C) {
  float result = 0.0;
  for (int i = 0; i < 100; ++i)
  {
      C.noalias() = A * B;

      int x = 0;
      int y = 0;

      result += C(x, y);
  }
  return result;
}

Using chrono package I have found that each step in the thread loop:

test += foo(A, B, C);

takes nearly 9.000 ms if I am running only a single thread:

int main(int argc, char ** argv)
{

    Eigen::initParallel();

    std::cout << Eigen::nbThreads() << " eigen threads
";

    std::thread t0(worker, "t-0");
    t0.join();

    return 0;
}

The problem comes up when I try to run 2 or more threads in parallel:

std::thread t0(worker, "t-0");
std::thread t1(worker, "t-1");

t0.join();
t1.join();

By my measurements (the detailed results can be found on the mentioned repository), when I am running two threads in parallel, each cycle of 100 multiplications takes 11.000 ms or more. When I am running 3 threads the performance is way worse (~23.000ms).

In the same experiment on my laptop (Ubuntu 20.04 64 bit, 4x Intel I9 9900K processors) the performance of each thread is nearly the same (~ 1600 ms) even if I am running only one thread, two or three.

The code I am using in this experiment + compilation instructions etc can be found in this repo: https://github.com/doleron/eigen3-multithread-arm-issue

EDIT about @Surt answer:

In order to check out the hypothesis of @Surt, I performed some slighty different experiments.

Running 100 cycles of 100,000 multiplications of matrices 16x16. The results can be seen in the following chart:

Running 100 cycles of 100,000 multiplications of matrices 64x64. The results can be seen in the following chart:

By my counts, the total memory cache required for 9 matrices 64x64 is:

64 x 64 x sizeof(double) x 9 = 294,912 in bytes

This amount of memory represents 28,1% of the 1 MiB cache which leaves some space for other objects running into the processors memory. 9 matrices are 3 matrices per thread, namely matrix A, B and C. Note I'm using C.noalias() = A * B; to avoid a temporary matrix for A * B.

Running 100 cycles of 100,000 multiplications of matrices 128x128. The results can be seen in the following chart:

The expected amount of memory for nine 128x128 matrices is 1,179,648 bytes, more than 112% of total avalable cache. So, this last scenario is likely to hit the processors cache bottleneck.

I think the results shown in the previous charts confirm the @Surt hypothesis and I will accept he/she answer as correct. Checking the charts carefully it is possible to see a slighty difference among the scenario with 1 single thread and 2 or more threads when matrix's size is 16 or 64. I think that it is due to general OS scheduler overhead.

Surt · Accepted Answer

Basically your PI has a much smaller cache than your desktop PC CPU, which means your program on PI has cache collisions much oftener than your PC.

512*512*4 (sizeof(float)) or 1 MB per instance.

On your PC with perhaps 12 MB L3 cache you will never to to RAM (maybe on allocation) while the tiny cache on the PI will be blown away.

The Raspberry Pi 4 uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor, with 1 MiB shared L2 cache.

So the PI will use a lot of time pulling in data from RAM.

If you on the other hand had split the same work between the different threads then you might have seen an improved performance (or at least not decreased), some blocking scheme might even have leveraged L1 cache (on both machines).

Performance drops when running Eigen dense matrix multiplications over multiple cores on ARM / Raspberry PI

Answers (1)

Related Questions