Reputation: 1371
I am currently writing a linalg library in c++, for educational purposes and personal use. As part of it I implemented a custom matrix class with custom row and column iterators. While providing very nice feature to work with std::algorithm and std::numeric functions, I performed a speed comparison for a matrix multiplication between an index and iterator/std::inner_product approach. The results differ significantly:
// used later on for the custom iterator
template<class U>
struct EveryNth {
bool operator()(const U& ) { return m_count++ % N == 0; }
EveryNth(std::size_t i) : m_count(0), N(i) {}
EveryNth(const EveryNth& element) : m_count(0), N(element.N) {}
private:
int m_count;
std::size_t N;
};
template<class T,
std::size_t rowsize,
std::size_t colsize>
class Matrix
{
private:
// Data is stored in a MVector, a modified std::vector
MVector<T> matrix;
std::size_t row_dim;
std::size_t column_dim;
public:
// other constructors, this one is for matrix in the computation
explicit Matrix(MVector<T>&& s): matrix(s),
row_dim(rowsize),
column_dim(colsize){
}
// other code...
typedef boost::filter_iterator<EveryNth<T>,
typename std::vector<T>::iterator> FilterIter;
// returns an iterator that skips elements in a range
// if "to" is to be specified, then from has to be set to a value
// @ param "j" - j'th column to be requested
// @ param "from" - starts at the from'th element
// @ param "to" - goes from the from'th element to the "to'th" element
FilterIter begin_col( std::size_t j,
std::size_t from = 0,
std::size_t to = rowsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( cols() ),
matrix.Begin() + index( from, j ),
matrix.Begin() + index( to, j )
);
}
// specifies then end of the iterator
// so that the iterator can not "jump" past the last element into undefines behaviour
FilterIter end_col( std::size_t j,
std::size_t to = rowsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( cols() ),
matrix.Begin() + index( to, j ),
matrix.Begin() + index( to, j )
);
}
FilterIter begin_row( std::size_t i,
std::size_t from = 0,
std::size_t to = colsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( 1 ),
matrix.Begin() + index( i, from ),
matrix.Begin() + index( i, to )
);
}
FilterIter end_row( std::size_t i,
std::size_t to = colsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( 1 ),
matrix.Begin() + index( i, to ),
matrix.Begin() + index( i, to )
);
}
// other code...
// allows to access an element of the matrix by index expressed
// in terms of rows and columns
// @ param "r" - r'th row of the matrix
// @ param "c" - c'th column of the matrix
std::size_t index(std::size_t r, std::size_t c) const {
return r*cols()+c;
}
// brackets operator
// return an elements stored in the matrix
// @ param "r" - r'th row in the matrix
// @ param "c" - c'th column in the matrix
T& operator()(std::size_t r, std::size_t c) {
assert(r < rows() && c < matrix.size() / rows());
return matrix[index(r,c)];
}
const T& operator()(std::size_t r, std::size_t c) const {
assert(r < rows() && c < matrix.size() / rows());
return matrix[index(r,c)];
}
// other code...
// end of class
};
Now in the main function in run the following:
int main(int argc, char *argv[]){
Matrix<int, 100, 100> a = Matrix<int, 100, 100>(range<int>(10000));
std::clock_t begin = clock();
double b = 0;
for(std::size_t i = 0; i < a.rows(); i++){
for (std::size_t j = 0; j < a.cols(); j++) {
std::inner_product(a.begin_row(i), a.end_row(i),
a.begin_column(j),0);
}
}
// double b = 0;
// for(std::size_t i = 0; i < a.rows(); i++){
// for (std::size_t j = 0; j < a.cols(); j++) {
// for (std::size_t k = 0; k < a.rows(); k++) {
// b += a(i,k)*a(k,j);
// }
// }
// }
std::clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << elapsed_secs << std::endl;
std::cout << "--- End of test ---" << std::endl;
std::cout << std::endl;
return 0;
}
For the std::inner_product/iterator approach it takes:
bash-3.2$ ./main
3.78358
--- End of test ---
and for the index (// out) approach:
bash-3.2$ ./main
0.106173
--- End of test ---
which is almost 40 times faster then the iterator approach. Do you see anything in the code that could slow down iterator computation that much? I should mention that I tried out both methods and they produce correct results.
Thank you for your ideas.
Upvotes: 1
Views: 2678
Reputation: 6162
This is just some additional information and general advice for low-level code optimization.
To conclusively find out where time is spent in low-level code (tight loops and hotspots),
Some possibilities:
boost::filter_iterator
to implement an iterator that skips every N element is wasteful. The internal implementation must increment by one at a time. If N is large, visiting the next element via boost::filter_iterator
becomes an O(N)
operation, as opposed to a simple iterator arithmetic which would be an O(1)
operation.boost::filter_iterator
implementation uses the modulo operator. Although integer division and modulo operations are fast on modern CPUs, it is still not as fast as simple integer arithmetic.Simply speaking,
Upvotes: 2
Reputation: 7667
What you have to understand is that matrix operations are VERY well-understood, and compilers are VERY good at doing optimizations of the things that are involved in matrix operations.
Consider C = AB, where C is MxN, A is MxQ, B is QxN.
double a[M][Q], b[Q][N], c[M][N];
for(unsigned i = 0; i < M; i++){
for (unsigned j = 0; j < N; j++) {
double temp = 0.0;
for (unsigned k = 0; k < Q; k++) {
temp += a[i][k]*b[k][j];
}
c[i][j] = temp;
}
}
(You would not believe how tempted I was to write the above in FORTRAN IV.)
The compiler looks at this, and notices that what is really happening is that he is walking through a and c with a stride of 1 and b with a stride of Q. He eliminates the multiplications in the subscript calculations and does straight indexing.
At that point, the inner loop is of the form:
temp += a[r1] * b[r2];
r1 += 1;
r2 += Q;
And you have loops around that to (re)initialize r1 and r2 for each pass.
That is the absolute minimum computation you can do to do a straightforward matrix multiplication. You cannot do any less than that, because you have to do those multiplications and additions and index adjustments.
All you can do is add overhead.
That's what the iterator and std::inner_product() approach does: it adds metric tonnes of overhead.
Upvotes: 2