user635832
user635832

Reputation: 485

A Cache Efficient Matrix Transpose Program?

So the obvious way to transpose a matrix is to use :

  for( int i = 0; i < n; i++ )

    for( int j = 0; j < n; j++ )

      destination[j+i*n] = source[i+j*n];

but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?

Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.

Edit2: The matrices are stored in column major order, that is to say for a matrix

a1 a2    
a3 a4

is stored as a1 a3 a2 a4.

Upvotes: 38

Views: 71628

Answers (8)

Etienne M
Etienne M

Reputation: 656

On my machine, this code reaches 45% of memcpy speed (single threaded ofc): https://godbolt.org/z/9Y3GzE1Tx

template <typename T>
void transpose2(const T* in, T* out, size_t nrow, size_t ncol) {
    static constexpr size_t kTileRowCount = 4;
    static constexpr size_t kTileColCount = 16;

    const auto transpose_block = [=](size_t col, size_t row) {
        double scratch[kTileRowCount * kTileColCount];

        for (size_t icol = 0; icol < kTileColCount; ++icol) {
            for (size_t irow = 0; irow < kTileRowCount; ++irow) {
                scratch[irow + icol * kTileRowCount] =
                    in[(row + irow) + (col + icol) * nrow];
            }
        }

        for (size_t irow = 0; irow < kTileRowCount; ++irow) {
            for (size_t icol = 0; icol < kTileColCount; ++icol) {
                out[(col + icol) + (row + irow) * ncol] =
                    scratch[irow + icol * kTileRowCount];
            }
        }
    };

    for (size_t iblockcol = 0; iblockcol < ncol; iblockcol += kTileColCount) {
        for (size_t iblockrow = 0; iblockrow < nrow; iblockrow += kTileRowCount) {
            transpose_block(iblockcol, iblockrow);
        }
    }
}

Upvotes: 0

Mircea G
Mircea G

Reputation: 11

Here is a sample C++ code for cache friendly parallelized matrix transposition:

constexpr auto range(unsigned max) 
    { return std::views::iota(0u, max); }
constexpr auto range(unsigned start, unsigned max) 
    { return std::views::iota(start, max); }

constexpr auto SZ = 1000u;
std::array<std::array<int, SZ>, SZ> mx;

void testtranspose(void)
{
    for (int i = 0; i < SZ; i++) 
        for (int j = 0; j < SZ; j++)
            mx[i][j] = j;

    // ***************
    //       cache friendly in place parallel transposition
    constexpr unsigned block_size = 10; assert(SZ % block_size == 0);
    std::for_each(std::execution::par_unseq,
        range(SZ / block_size).begin(), range(SZ / block_size).end(), 
        [&](const unsigned iib)
        {   //cache friendly transposition
            const unsigned ii = iib * block_size;
            //transpose the diagonal block
            for (unsigned i = ii; i < ii + block_size; i++)
                for (unsigned j = i + 1; j < ii + block_size; j++)
                    std::swap(mx[i][j], mx[j][i]);
            //transpose the rest of blocks
            for (unsigned jj = ii + block_size; jj < SZ; jj += block_size)
                for (unsigned i = ii; i < ii + block_size; i++)
                    for (unsigned j = jj; j < jj + block_size; j++)
                        std::swap(mx[i][j], mx[j][i]);
        });
    // ***************

    for (int i = 0; i < SZ; i++)
        for (int j = 0; j < SZ; j++)
            assert(mx[i][j] == i);
}

Upvotes: 0

Arnaud
Arnaud

Reputation: 3741

I had the exact same problem yesterday. I ended up with this solution:

void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
    THROWS();
    size_t block = 32;
    for (size_t i = 0; i < n; i += block) {
        for(size_t j = 0; j < p; ++j) {
            for(size_t b = 0; b < block && i + b < n; ++b) {
                dst[j*n + i + b] = src[(i + b)*p + j];
            }
        }
    }
}

This is 4 time faster than the obvious solution on my machine.

This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.

if dst and src are the same square matrix an in place function should really be used instead:

void transpose(double*m,size_t n)noexcept{
    size_t block=0,size=8;
    for(block=0;block+size-1<n;block+=size){
        for(size_t i=block;i<block+size;++i){
            for(size_t j=i+1;j<block+size;++j){
                std::swap(m[i*n+j],m[j*n+i]);}}
        for(size_t i=block+size;i<n;++i){
            for(size_t j=block;j<block+size;++j){
                std::swap(m[i*n+j],m[j*n+i]);}}}
    for(size_t i=block;i<n;++i){
        for(size_t j=i+1;j<n;++j){
            std::swap(m[i*n+j],m[j*n+i]);}}}

I used C++11 but this could be easily translated in other languages.

Upvotes: 15

mariolpantunes
mariolpantunes

Reputation: 1132

Steve Jessop mentioned a cache oblivious matrix transpose algorithm. For the record, I want to share an possible implementation of a cache oblivious matrix transpose.

public class Matrix {
    protected double data[];
    protected int rows, columns;

    public Matrix(int rows, int columns) {
        this.rows = rows;
        this.columns = columns;
        this.data = new double[rows * columns];
    }

    public Matrix transpose() {
        Matrix C = new Matrix(columns, rows);
        cachetranspose(0, rows, 0, columns, C);
        return C;
    }

    public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
        int r = re - rb, c = ce - cb;
        if (r <= 16 && c <= 16) {
            for (int i = rb; i < re; i++) {
                for (int j = cb; j < ce; j++) {
                    T.data[j * rows + i] = data[i * columns + j];
                }
            }
        } else if (r >= c) {
            cachetranspose(rb, rb + (r / 2), cb, ce, T);
            cachetranspose(rb + (r / 2), re, cb, ce, T);
        } else {
            cachetranspose(rb, re, cb, cb + (c / 2), T);
            cachetranspose(rb, re, cb + (c / 2), ce, T);
        }
    }
}

More details on cache oblivious algorithms can be found here.

Upvotes: 7

Luther
Luther

Reputation: 1838

With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

Upvotes: 0

payne
payne

Reputation: 14177

Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?

Upvotes: 9

aaz
aaz

Reputation: 5196

Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.

With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.

Or do it the other way around and read in columns while writing linearly.

Upvotes: 3

Steve Jessop
Steve Jessop

Reputation: 279255

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        // transpose the block beginning at [i,j]
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

Upvotes: 49

Related Questions