iOS BLAS - Accelerate framework poor matrix multiplication performance

Question

I'm implementing a tangent distance based OCR solution for the iPhone, which heavily relies on fast multiplication of floating-point matrices of size 253x7. For the proof of concept, I've implemented my own naive matrix routines like this:

Matrix operator*(const Matrix& matrix) const {
    if(cols != matrix.rows) throw "cant multiply!";

    Matrix result(rows, matrix.cols);
    for(int i = 0; i < result.rows; i++){
        for(int j = 0; j < result.cols; j++){
            T tmp = 0;
            for(int k = 0; k < cols; k++){
                tmp += at(i,k) * matrix.at(k,j);
            }
            result.at(i,j) = tmp;
        }
    }

    return result;
}

As you can see, it's pretty basic. After the PoC performed well, I've decided to push the performance limits further, by incorporating the Accelerate Framework's matrix multiplication (which presumably uses SIMD, and other fancy stuff, to do the heavy lifting...):

Matrix operator*(const Matrix& m) const {
    if(cols != m.rows) throw "cant multiply!";

    Matrix result(rows,m.cols);

    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, rows, m.cols, cols, 1, matrix, cols, m.matrix, m.cols, 1, result.matrix, result.cols);

    return result;
}

Shockingly (at least for me), the above code took double the time to multiply the matrices! I've tried using single precision instead of double, cause I was suspicious that it was something related with the CPU's word-size (32 bit float vs. 64 bit double on a 32 bit ARM), but had no performance gain...

What am I doing wrong? Are my 253x7 matrices too small for noticeable performance boost over the naive implementation?

Rob Napier · Accepted Answer

Basically, yes. The "x7" portion is likely too small to make the overhead of CBLAS worth it. The cost of making a function call, plus all the flexibility that the CBLAS functions give you, takes a while to make back up. Every time you pass a option like CblasNoTrans remember that there's an if() in there to manage that option. cblas_dgemm in particular accumulates into C, so it has to read the previous result element, apply a multiply, and then add before storing. That's a lot of extra work.

You may want to try the vDSP functions rather than CBLAS. vDSP_mmul is a bit simpler and doesn't accumulate into the result. I've had good luck with vDSP_* on small data sets (a few thousand elements).

That said, my experience with this is that naïve C implementations can often be quite fast on small data sets. Avoiding a function call is a huge benefit. Speaking of which, make sure that your at() call is inlined. Otherwise you're wasting a lot of time in your loop. You can likely speed up the C implementation by using pointer additions to move serially through your matrices rather than multiplies (which are required for random access through []). On a matrix this small, it may or not be worth it; you'd have to profile a bit. Looking at the assembler output is highly instructive.

Keep in mind that you absolutely must profile this stuff on the device. Performance in the simulator is irrelevant. It's not just that the simulator is faster; it's completely different. Things that are wildly faster on the simulator can be much slower on device.

iOS BLAS - Accelerate framework poor matrix multiplication performance

Answers (2)

Related Questions