Reputation: 9321
I'm implementing a tangent distance based OCR solution for the iPhone, which heavily relies on fast multiplication of floating-point matrices of size 253x7. For the proof of concept, I've implemented my own naive matrix routines like this:
Matrix operator*(const Matrix& matrix) const {
if(cols != matrix.rows) throw "cant multiply!";
Matrix result(rows, matrix.cols);
for(int i = 0; i < result.rows; i++){
for(int j = 0; j < result.cols; j++){
T tmp = 0;
for(int k = 0; k < cols; k++){
tmp += at(i,k) * matrix.at(k,j);
}
result.at(i,j) = tmp;
}
}
return result;
}
As you can see, it's pretty basic. After the PoC performed well, I've decided to push the performance limits further, by incorporating the Accelerate Framework's matrix multiplication (which presumably uses SIMD, and other fancy stuff, to do the heavy lifting...):
Matrix operator*(const Matrix& m) const {
if(cols != m.rows) throw "cant multiply!";
Matrix result(rows,m.cols);
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, rows, m.cols, cols, 1, matrix, cols, m.matrix, m.cols, 1, result.matrix, result.cols);
return result;
}
Shockingly (at least for me), the above code took double the time to multiply the matrices! I've tried using single precision instead of double, cause I was suspicious that it was something related with the CPU's word-size (32 bit float vs. 64 bit double on a 32 bit ARM), but had no performance gain...
What am I doing wrong? Are my 253x7 matrices too small for noticeable performance boost over the naive implementation?
Upvotes: 1
Views: 3185
Reputation: 106167
A couple of questions:
253 x 7 multiplied by what size matrix? If you’re doing, say, 253x7 * 7x1, then a general-purpose multiply routine is going to spend most of its time in edging code, and there’s very little that a tuned library can do that will make it faster than a naive implementation.
What hardware are you timing on, and what iOS version? Especially for double precision, older hardware and older iOS versions are more limited performance-wise. On a Cortex-A8, for example, double-precision arithmetic is completely unpipelined, so there’s almost nothing a library can do to beat a naive implementation.
If the other matrix isn’t ridiculously small, and the hardware is recent, please file a bug (unexpectedly low performance is absolutely a bug). Small matrices with high aspect ratio are quite difficult to make fast in a general-purpose matrix-multiply, but it’s still a good bug to file.
If the hardware/iOS version is old, you may want to use Accelerate anyway, as it should perform significantly better on newer hardware/software.
If the other matrix is just tiny, then there may not be much to be done. There is no double-precision SIMD on ARM, the matrices are too small to benefit from cache blocking, and the dimensions of the matrices would also be too small to benefit much from loop unrolling.
If you know a priori that your matrices will be exactly 253x7 * 7x???, you should be able to do much better than both a naive implementation and any general-purpose library by completely unrolling the inner dimension of the matrix multiplication.
Upvotes: 1
Reputation: 299325
Basically, yes. The "x7" portion is likely too small to make the overhead of CBLAS worth it. The cost of making a function call, plus all the flexibility that the CBLAS functions give you, takes a while to make back up. Every time you pass a option like CblasNoTrans
remember that there's an if()
in there to manage that option. cblas_dgemm
in particular accumulates into C, so it has to read the previous result element, apply a multiply, and then add before storing. That's a lot of extra work.
You may want to try the vDSP functions rather than CBLAS. vDSP_mmul
is a bit simpler and doesn't accumulate into the result. I've had good luck with vDSP_*
on small data sets (a few thousand elements).
That said, my experience with this is that naïve C implementations can often be quite fast on small data sets. Avoiding a function call is a huge benefit. Speaking of which, make sure that your at()
call is inlined. Otherwise you're wasting a lot of time in your loop. You can likely speed up the C implementation by using pointer additions to move serially through your matrices rather than multiplies (which are required for random access through []
). On a matrix this small, it may or not be worth it; you'd have to profile a bit. Looking at the assembler output is highly instructive.
Keep in mind that you absolutely must profile this stuff on the device. Performance in the simulator is irrelevant. It's not just that the simulator is faster; it's completely different. Things that are wildly faster on the simulator can be much slower on device.
Upvotes: 0