Reputation: 21
Consider a matrix A of floats of size 48x16 and a vector of floats b of size 1x48.
Please suggest a way of calculating b×A as fast as possible on common desktop processors (i5/i7).
Background. The above product is in a tight loop, therefore its fast calculation is essential. At the moment I am working with the following naive algorithm:
inline void Critical(const float A[48][16], const float b[48], float x[16] ) const {
for (int u = 0; u < 48; ++u) {
for (int i = 0; i < 16; ++i) {
x[i] += A[u][i] * b[u];
}
}
}
I have tried to offload the multiplication to MKL's SGEMV and later to SGEMM but to no avail. The naive implementation is still works faster on a i7 4800MQ.
EDIT1.
Eigen with static allocation is approximately as fast as the naive algorithm.
I've tried GCC5, ICC and VC2015U3 with optimizations turned on (/O3, fast math, mtune=native, etc). ICC seems to produce the fastest code on both Linux and Windows.
EDIT2.
The elements of A are small, max(|A_ui|) = 256. Similarly max(|b_u|) = 1.0. Reasonably approximate solutions are also acceptable as long as the algorithm is faster than the naive one.
Upvotes: 2
Views: 298
Reputation: 9771
MKL usually has a large overhead thus the performance is poor for small matrix. Eigen on the other hand, have fixed-size matrix optimization which performs good on small matrix. You also need correct compile options to get the max performance with Eigen.
#include <iostream>
#include <Eigen/Eigen>
#include <omp.h>
inline void Critical(const float A[48][16], const float b[48], float x[16]) {
for (int i = 0; i < 16; ++i) {
x[i] = 0;
}
for (int u = 0; u < 48; ++u) {
for (int i = 0; i < 16; ++i) {
x[i] += A[u][i] * b[u];
}
}
}
int main() {
float a[48][16] = { 0 };
float b[48] = { 0 };
float x[16] = { 0 };
Eigen::Matrix<float, 48, 16> ma;
Eigen::Matrix<float, 1, 48> mb;
Eigen::Matrix<float, 1, 16> mx;
ma.setRandom();
mb.setRandom();
for (int i = 0; i < 48; ++i) {
for (int j = 0; j < 16; ++j) {
a[i][j] = ma(i, j);
}
b[i] = mb(i);
}
double t;
int n = 10000000;
t = omp_get_wtime();
for (int i = 0; i < n; ++i) {
Critical(a, b, x);
}
t = omp_get_wtime() - t;
std::cout << "for-loop time: " << t << std::endl;
t = omp_get_wtime();
for (int i = 0; i < n; ++i) {
mx = mb * ma;
}
t = omp_get_wtime() - t;
std::cout << "eigen time: " << t << std::endl;
Eigen::Map < Eigen::Matrix<float, 1, 16> > native_x(x);
std::cout << "error: " << (mx - native_x).norm() << std::endl;
return 0;
}
When compiling with g++ 5.2.1
$ g++ -fopenmp -O3 -DNDEBUG -I~/program/include/eigen3 -o test/gemv test/gemv.cpp && test/gemv
for-loop time: 2.53004
eigen time: 1.17458
error: 1.49636e-06
When compiling with icpc 16.0.2
$ icpc -fopenmp -fast -DNDEBUG -I~/program/include/eigen3 -o test/gemv test/gemv.cpp && test/gemv
for-loop time: 1.03432
eigen time: 1.01054
error: 1.40769e-06
icpc uses auto-vectorization on fop-loops thus the performance is same with Eigen.
Upvotes: 2