Reputation: 2628
For simplicity lets assume that I have a vector of N
matrices each of M
rows. I am using STL std::accumulate
to compute the sum of all the matrices. I pass a binary functor that accepts two matrices (by reference) and returns their sum (by reference). Full disclosure: I am using libstdc++ parallel mode. Inside the functor I loop over the rows individually to compute the sum.
Though each matrix is too large to fit in the cache, a row fits in very nicely. So it would be advantageous to re-order the loops so that the outer loop indexes over the M
rows and the inner one over the N
matrices. In addition to defining the functor inline, is there anything else I can do to encourage such a cross-function-boundary loop re-ordering. I can of course restructure the code, but I would ideally like to keep the simple structure that the use of STL algorithms afford. If there is something that is gcc specific I wouldnt mind that either.
I am not actually dealing with matrices, that was just an example, but the same problem structure applies. The main issue is that of performance. Explaining the actual scenario would be too cumbersome, but the core problem is this: STL's accumulate entails an ordering among the nested loops that isnt very cache friendly because it tries to complete the addition of two objects before moving on to the next object. A single object is too big to be held in cache, but parts of it can be. So the execution can be sped up if one computes the 'additions' one 'part' at a time (over all objects). Hand reordering the loops leads to substantial improvement in FLOPS. But I would ideally like the compiler to do the re-ordering so that I can code at the STL level (as far as possible). So I am looking for tricks to do this.
Upvotes: 1
Views: 272
Reputation: 23916
class Matrix;
class Row;
struct SumNRow {
int _rowidx;
// Row _tempRow; //For return by reference left out for simplicity
SumNRow(int iRowIdx): _rowIdx(iRowIdx) {}
Row operator(const Matrix & iMarix1, const Matrix iMatrix2) {
return iMarix1[_rowIdx] + iMatrix2[_rowIdx];
}
};
template<class MatrixIterator>
void sum(const MatrixIterator & iMarixStart, const MatrixIterator & iMatrixEnd, Matrix & oMarix) {
for (int i = 0; i < iMarixStart->rowCount(); ++i) {
oMarix[i]=std::accumulate(iMarixStart, iMatrixEnd, SumNRow(i));
}
}
Upvotes: 1
Reputation: 58677
Write a new algorithm, or wrap things in a for-loop or a std::for_each()
call. It's going to be much easier than finding ways to adapt std::accumulate()
. I think that the only other alternative here is to introduce a new abstraction level to the library, that is beyond iterators. It's easier to just write a new algorithm or introduce an extra loop.
Upvotes: 1
Reputation: 89172
I can't imagine the compiler figuring this out unless everything was inlined and M and N were constant. Even then, it would be a stretch.
To keep STL algorithmic style, use a foreach M over the accumulate and have the functor just sum a row.
Upvotes: 1