Timothy Miller
Timothy Miller

Reputation: 1599

Armadillo: Matrix multiplications taking up huge amounts of memory

I'm trying to use armadillo to do linear regression as in the following function:

void compute_weights()
{
    printf("transpose\n");
    const mat &xt(X.t());
    printf("inverse\n");
    mat xd;
    printf("mul\n");
    xd = (xt * X);
    printf("inv\n");
    xd = xd.i();
    printf("mul2\n");
    xd = xd * xt;
    printf("mul3\n");
    W = xd * Y;
}

I've split this up so I could see what was going on with the program getting so huge. The matrix X has 64 columns and over 23 million rows. The transpose isn't too bad, but that first multiply causes the memory footprint to completely blow up. Now, as I understand it, if I multiply X.t() * X, each element of the matrix product will be the dot product of a column of X and a row of X.t(), and the result should be a 64x64 matrix.

Sure, it should take a long time, but why would the memory suddenly blow up to nearly 30 gigabytes?

Then it seems to hang on to that memory, and then when it gets to the second multiply, it's just too much, and the OS kills it for getting so huge.

Is there a way to compute products without so much memory usage? Can that memory be reclaimed? Is there a better way to represent these calculations?

Upvotes: 0

Views: 829

Answers (2)

dmuir
dmuir

Reputation: 4431

You can compute the weights using far less memory using the QR decomposition (You might want to look up 'least squares QR');

Briefly: Use householder transformations to (implicitly) find orthogonal Q so that

Q'*X = R where R is upper triangular

and at the same time transform Y

Q'*Y = y

Solve

R*y = W for W using only the top 64 rows of R and y

If you are willing to overwrite Z and Y, then this requires no extra memory; otherwise you will need a copy of X and a copy of Y.

Upvotes: 1

The Quantum Physicist
The Quantum Physicist

Reputation: 26276

You don't stand a chance doing this whole multiplication in one shot, unless you use a huge workstation. Like hbrerkere said, your initial consumption is about 22 GB. So you either be ready for that, or find another way.

If you don't have such a workstation, another way is to do the multiplication yourself, and parallelize it. Here's how you do it:

  1. Don't load the whole matrix into memory, but load parts of it.
  2. Load like a million rows of X, and store it somewhere.
  3. Load a million columns of Y
  4. Use std::transform with the binary operator std::multiplies to multiply the parts you loaded (this will utilize your processor's vectorization, and make it fast), and fill in the partial result you calculated.
  5. Load the next part of your matrices, and repeat

This won't be as efficient, but it will work. Also another option is to consider using Armadillo after decomposing your matrix to smaller matrices, whose multiplication will yield sub-results.

Both methods are much slower than the full multiplication for 2 reasons:

  1. The overhead of loading and deleting data from memory
  2. Matrix multiplication is already an O(N^3) problem... and now splitting your multiplication is O(N^2), so it'll become O(N^6)...

Good luck!

Upvotes: 1

Related Questions