How To Optimizing Memory Hits in Matrix Computations by Setting Contiguous Blocks by Row or by Column

Question

I am looking to optimize memory-hits by how I store the memory of a two-dimensional matrix. I was going to fold the 2D matrix into a 1D contiguous block and was wondering if it would make more sense to store the data as consecutive blocks by row or by column. The type of operations I am considering are the more expensive operations such as multiplication and SVD. Note that I am considering the implementation in C++.

Clarification on Configuration

By consecutive rows or columns I mean the following. Consider a 3x3 matrix

[a11 a12 a13]
[a21 a22 a23]
[a31 a32 a33]

Would it make more sense to store the matrix by row

[[a11 a12 a13] [a21 a22 a23] [a31 a32 a33]]

and then each element at [i,j] would be accessed as [i*nCol + j] and any element a[i,j] is closer in memory to a[i,j+1] than to a[i+1,j]

Or by column?

[[a11 a21 a31] [a12 a22 a32] [a13 a23 a33]]

and then each element at [i,j] would be accessed as [j*nRow + i] and any element a[i,j] is closer in memory to a[i+1,j] than a[i,j+1]

Now say we had a cache that only loaded blocks of three doubles at once. In the first case, to access a11, a12, and a13 would require loading one block. In the second case, to access a11, a13, and a13 would require loading three blocks. This may not be an issue for a 3x3 matrix where both cases need to load three blocks to complete the computation, and all three could easily fit inside our cache at a time, but this might become an issue when we have very large matrices where you cannot fit the entire matrix in your cache at one time.

Intuitive Response

I have done some research on storing a 2D matrix as a 1D array such as:

1d-or-2d-array-whats-faster

And also about the operators involved in matrix multiplication such as

Wikipedia article on Matrix Multiplication

and the associated Strassen Algorithm.

It seems that due to the nature of matrix multiplication, you traverse one matrix by row and the other by column. Intuitively, I would think that what performance you gain from storing the data in one configuration over the other is lost in this specific operation.

ie. Consider multiplying a two 2x2 matrices C = AB, where A is NxM and B is MxL

c[i,k] = sum(a[i,m] * b[m,k]) for m = [1...M]

You are accessing the data in rows in the left matrix and in columns in the right matrix, so you have no advantage of storing the data closer together as what is better for one matrix is worse for the other.

Considering the computationally expensive operations on matrices, would one of these configurations would be better in terms of memory access? Or is this a non-issue considering that practical large-scale matrix multiplication is done on GPUs or a similar configuration? Or is the cost of loading blocks of memory over-shadowed by something else?

How To Optimizing Memory Hits in Matrix Computations by Setting Contiguous Blocks by Row or by Column

Answers (1)

Related Questions