c++ speed comparison iterator vs index

Question

I am currently writing a linalg library in c++, for educational purposes and personal use. As part of it I implemented a custom matrix class with custom row and column iterators. While providing very nice feature to work with std::algorithm and std::numeric functions, I performed a speed comparison for a matrix multiplication between an index and iterator/std::inner_product approach. The results differ significantly:

// used later on for the custom iterator
template
struct EveryNth {
    bool operator()(const U& ) { return m_count++ % N == 0; }
    EveryNth(std::size_t i) : m_count(0), N(i) {}
    EveryNth(const EveryNth& element) : m_count(0), N(element.N) {}

private:
    int m_count;
    std::size_t N;
};

template  
class Matrix
{

private:

    // Data is stored in a MVector, a modified std::vector
    MVector matrix;

    std::size_t row_dim;                  
    std::size_t column_dim;

public:

    // other constructors, this one is for matrix in the computation
    explicit Matrix(MVector&& s): matrix(s), 
                                     row_dim(rowsize), 
                                     column_dim(colsize){
    }    

    // other code...

    typedef boost::filter_iterator, 
                                   typename std::vector::iterator> FilterIter;

    // returns an iterator that skips elements in a range
    // if "to" is to be specified, then from has to be set to a value
    // @ param "j" - j'th column to be requested
    // @ param "from" - starts at the from'th element
    // @ param "to" - goes from the from'th element to the "to'th" element
    FilterIter  begin_col( std::size_t j,
                           std::size_t from = 0, 
                           std::size_t to = rowsize ){
        return boost::make_filter_iterator >(
            EveryNth( cols() ), 
            matrix.Begin() + index( from, j ), 
            matrix.Begin() + index( to, j )
            );
    }

    // specifies then end of the iterator
    // so that the iterator can not "jump" past the last element into undefines behaviour
    FilterIter end_col( std::size_t j, 
                        std::size_t to = rowsize ){
        return boost::make_filter_iterator >(
            EveryNth( cols() ), 
            matrix.Begin() + index( to, j ), 
            matrix.Begin() + index( to, j )
            );
    }

    FilterIter  begin_row( std::size_t i,
                           std::size_t from = 0,
                           std::size_t to = colsize ){
         return boost::make_filter_iterator >(
            EveryNth( 1 ), 
            matrix.Begin() + index( i, from ), 
            matrix.Begin() + index( i, to )
            );
    }

    FilterIter  end_row( std::size_t i,
                         std::size_t to = colsize ){
        return boost::make_filter_iterator >(
            EveryNth( 1 ), 
            matrix.Begin() + index( i, to ), 
            matrix.Begin() + index( i, to )
            );
    }

    // other code...

    // allows to access an element of the matrix by index expressed
    // in terms of rows and columns
    // @ param "r" - r'th row of the matrix
    // @ param "c" - c'th column of the matrix
    std::size_t index(std::size_t r, std::size_t c) const {
        return r*cols()+c; 
    }

    // brackets operator
    // return an elements stored in the matrix
    // @ param "r" - r'th row in the matrix
    // @ param "c" - c'th column in the matrix
    T& operator()(std::size_t r, std::size_t c) { 
        assert(r < rows() && c < matrix.size() / rows());
        return matrix[index(r,c)]; 
    }

    const T& operator()(std::size_t r, std::size_t c) const {
        assert(r < rows() && c < matrix.size() / rows()); 
        return matrix[index(r,c)]; 
    }

    // other code...

    // end of class
};

Now in the main function in run the following:

int main(int argc, char *argv[]){


    Matrix a = Matrix(range(10000));


    std::clock_t begin = clock();
    double b = 0;
    for(std::size_t i = 0; i < a.rows(); i++){
        for (std::size_t j = 0; j < a.cols(); j++) {
                std::inner_product(a.begin_row(i), a.end_row(i), 
                                   a.begin_column(j),0);        
        }
    }

    // double b = 0;
    // for(std::size_t i = 0; i < a.rows(); i++){
    //     for (std::size_t j = 0; j < a.cols(); j++) {
    //         for (std::size_t k = 0; k < a.rows(); k++) {
    //             b += a(i,k)*a(k,j);
    //         }
    //     }
    // }


    std::clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << elapsed_secs << std::endl;

    std::cout << "--- End of test ---" << std::endl;

    std::cout << std::endl;
    return 0;
}

For the std::inner_product/iterator approach it takes:

bash-3.2$ ./main

3.78358
--- End of test ---

and for the index (// out) approach:

bash-3.2$ ./main

0.106173
--- End of test ---

which is almost 40 times faster then the iterator approach. Do you see anything in the code that could slow down iterator computation that much? I should mention that I tried out both methods and they produce correct results.

Thank you for your ideas.

John R. Strohm · Accepted Answer

What you have to understand is that matrix operations are VERY well-understood, and compilers are VERY good at doing optimizations of the things that are involved in matrix operations.

Consider C = AB, where C is MxN, A is MxQ, B is QxN.

double a[M][Q], b[Q][N], c[M][N];
for(unsigned i = 0; i < M; i++){
  for (unsigned j = 0; j < N; j++) {
    double temp = 0.0;
    for (unsigned k = 0; k < Q; k++) {
      temp += a[i][k]*b[k][j];
    }
    c[i][j] = temp;
  }
}

(You would not believe how tempted I was to write the above in FORTRAN IV.)

The compiler looks at this, and notices that what is really happening is that he is walking through a and c with a stride of 1 and b with a stride of Q. He eliminates the multiplications in the subscript calculations and does straight indexing.

At that point, the inner loop is of the form:

temp += a[r1] * b[r2];
r1 += 1;
r2 += Q;

And you have loops around that to (re)initialize r1 and r2 for each pass.

That is the absolute minimum computation you can do to do a straightforward matrix multiplication. You cannot do any less than that, because you have to do those multiplications and additions and index adjustments.

All you can do is add overhead.

That's what the iterator and std::inner_product() approach does: it adds metric tonnes of overhead.

c++ speed comparison iterator vs index

Answers (2)

Related Questions