tovugike
tovugike

Reputation: 21

fast single precision matrix times vector product

Consider a matrix A of floats of size 48x16 and a vector of floats b of size 1x48.

Please suggest a way of calculating b×A as fast as possible on common desktop processors (i5/i7).

Background. The above product is in a tight loop, therefore its fast calculation is essential. At the moment I am working with the following naive algorithm:

inline void Critical(const float A[48][16], const float b[48], float x[16] ) const {
    for (int u = 0; u < 48; ++u) {
        for (int i = 0; i < 16; ++i) {
            x[i] += A[u][i] * b[u];
        }
    }
}

I have tried to offload the multiplication to MKL's SGEMV and later to SGEMM but to no avail. The naive implementation is still works faster on a i7 4800MQ.

EDIT1.

Eigen with static allocation is approximately as fast as the naive algorithm.

I've tried GCC5, ICC and VC2015U3 with optimizations turned on (/O3, fast math, mtune=native, etc). ICC seems to produce the fastest code on both Linux and Windows.

EDIT2.

The elements of A are small, max(|A_ui|) = 256. Similarly max(|b_u|) = 1.0. Reasonably approximate solutions are also acceptable as long as the algorithm is faster than the naive one.

Upvotes: 2

Views: 298

Answers (1)

kangshiyin
kangshiyin

Reputation: 9771

MKL usually has a large overhead thus the performance is poor for small matrix. Eigen on the other hand, have fixed-size matrix optimization which performs good on small matrix. You also need correct compile options to get the max performance with Eigen.

#include <iostream>
#include <Eigen/Eigen>
#include <omp.h>

inline void Critical(const float A[48][16], const float b[48], float x[16]) {
  for (int i = 0; i < 16; ++i) {
    x[i] = 0;
  }
  for (int u = 0; u < 48; ++u) {
    for (int i = 0; i < 16; ++i) {
      x[i] += A[u][i] * b[u];
    }
  }
}

int main() {
  float a[48][16] = { 0 };
  float b[48] = { 0 };
  float x[16] = { 0 };
  Eigen::Matrix<float, 48, 16> ma;
  Eigen::Matrix<float, 1, 48> mb;
  Eigen::Matrix<float, 1, 16> mx;

  ma.setRandom();
  mb.setRandom();
  for (int i = 0; i < 48; ++i) {
    for (int j = 0; j < 16; ++j) {
      a[i][j] = ma(i, j);
    }
    b[i] = mb(i);
  }

  double t;
  int n = 10000000;

  t = omp_get_wtime();
  for (int i = 0; i < n; ++i) {
    Critical(a, b, x);
  }
  t = omp_get_wtime() - t;
  std::cout << "for-loop time: " << t << std::endl;

  t = omp_get_wtime();
  for (int i = 0; i < n; ++i) {
    mx = mb * ma;
  }
  t = omp_get_wtime() - t;
  std::cout << "eigen time: " << t << std::endl;

  Eigen::Map < Eigen::Matrix<float, 1, 16> > native_x(x);
  std::cout << "error: " << (mx - native_x).norm() << std::endl;

  return 0;
}

When compiling with g++ 5.2.1

$ g++ -fopenmp -O3 -DNDEBUG -I~/program/include/eigen3 -o test/gemv test/gemv.cpp && test/gemv
for-loop time: 2.53004
eigen time: 1.17458
error: 1.49636e-06

When compiling with icpc 16.0.2

$ icpc -fopenmp -fast -DNDEBUG -I~/program/include/eigen3 -o test/gemv test/gemv.cpp && test/gemv
for-loop time: 1.03432
eigen time: 1.01054
error: 1.40769e-06

icpc uses auto-vectorization on fop-loops thus the performance is same with Eigen.

Upvotes: 2

Related Questions