How to do a proper Cache Blocked Matrix Transposition?

Question

I am trying to do a Cache Blocked Matrix Transposition in C but I am having some troubles with something in my code. My guess is that it has to do with the indexes. Can you tell me where am I going wrong?

I am considering this both algorithm I found on the web: http://users.cecs.anu.edu.au/~Alistair.Rendell/papers/coa.pdf and http://iosrjen.org/Papers/vol3_issue11%20(part-4)/I031145055.pdf

But I couldn't figure it out yet how to correctly code those.

for (i = 0; i < N; i += block) {
    for (j = 0; j < i; j += block ) {

        for(ii=i;ii



temp1 and temp2 are two matrices of size block x block filled with zeros.
I am not sure if I have to do another for when I am returning the values to A2 (the before and after transposed matrix).

I also tried this:

for (i = 0; i < N; i += block) {
    for (j = 0; j < N; j += block ) {
    ii = A2[i][j];
    jj = A2[j][i];

        A2[j][i] = ii;
    A2[i][j] = jj;
    }                       
}


I am expecting better performance than the "naive" Matrix Transposition algorithm:

for (i = 1; i < N; i++) {
    for(j = 0; j < i; j++) {

        TEMP= A[i][j];
        A[i][j]=A[j][i];
        A[j][i]=TEMP;

    }
}

Alain Merigot · Accepted Answer

The proper way to do blocked matrix transposition is not what is in your program. The extra temp1 and temp2 array will uselessly fill you cache. And your second version is incorrect. More you do too much operations: elements are transposed twice and diagonal elements are "tranposed".

But first we can do some simple (and approximate) cache behavior analysis. I assume that you have a matrix of doubles and that cache lines are 64 bytes (8 doubles).

A blocked implementation is equivalent to a naive implementation if the cache can completely contain the matrix. You only have mandatory cache misses to fetch the matrix elements. The number of cache misses will be N×N/8 to process N×N elements, with an average number of misses of 1/8 per element.

Now, for the naive implementation, look at the situation after you have processed 1 line in the cache. Assuming you cache is large enough, you will have in your cache :
* the complete line A[0][i]
* the first 8 elements of every other lines of the matrix A[i][0..7]

This means that, if you cache is large enough, you can process the 7 successive lines without any cache miss other than the one to fetch the lines. So if your matrix is N×N, if cache size is larger than ~2×N×8, you will have only 8×N/8(lines)+N(cols)=2N cache misses to process 8×N elements, and the average number of misses per element is 1/4. Numerically, if L1 cache size is 32k, this will happen if N<2k. And if L2 cache is 256k, data will remain in cache L2 if N<16k. I do not think the difference between data in L1 and data in L2 will be really visible, thanks to the very efficient prefetch in modern processors.

If you have a very large matrix, after the end of first line, the beginning of the second line has been ejected from cache. This will happen if a line of your matrix completely fills the cache. In this situation, the number of cache misses will be much more important. Every line will have N/8 (to get the line) + N (to get the first elements of columns) cache misses, and there is an average on (9×N/8)/N≈1 miss per element.

So you can gain with a blocked implementation, but only for large matrices.

Here is a correct implementation of matrix transpose. It avoids a dual processing of element A[l][m] (when i=l and j=m or i=m and j=l), do not transpose diagonal elements and uses registers for the transposition.

Naive version

for (i=0;i



And the blocked version (we assume the matrix size is a multiple of block size)

for (ii=0;ii

How to do a proper Cache Blocked Matrix Transposition?

Answers (2)

Related Questions