Reputation: 1599
I'm trying to use armadillo to do linear regression as in the following function:
void compute_weights()
{
printf("transpose\n");
const mat &xt(X.t());
printf("inverse\n");
mat xd;
printf("mul\n");
xd = (xt * X);
printf("inv\n");
xd = xd.i();
printf("mul2\n");
xd = xd * xt;
printf("mul3\n");
W = xd * Y;
}
I've split this up so I could see what was going on with the program getting so huge. The matrix X has 64 columns and over 23 million rows. The transpose isn't too bad, but that first multiply causes the memory footprint to completely blow up. Now, as I understand it, if I multiply X.t() * X, each element of the matrix product will be the dot product of a column of X and a row of X.t(), and the result should be a 64x64 matrix.
Sure, it should take a long time, but why would the memory suddenly blow up to nearly 30 gigabytes?
Then it seems to hang on to that memory, and then when it gets to the second multiply, it's just too much, and the OS kills it for getting so huge.
Is there a way to compute products without so much memory usage? Can that memory be reclaimed? Is there a better way to represent these calculations?
Upvotes: 0
Views: 829
Reputation: 4431
You can compute the weights using far less memory using the QR decomposition (You might want to look up 'least squares QR');
Briefly: Use householder transformations to (implicitly) find orthogonal Q so that
Q'*X = R where R is upper triangular
and at the same time transform Y
Q'*Y = y
Solve
R*y = W for W using only the top 64 rows of R and y
If you are willing to overwrite Z and Y, then this requires no extra memory; otherwise you will need a copy of X and a copy of Y.
Upvotes: 1
Reputation: 26276
You don't stand a chance doing this whole multiplication in one shot, unless you use a huge workstation. Like hbrerkere said, your initial consumption is about 22 GB. So you either be ready for that, or find another way.
If you don't have such a workstation, another way is to do the multiplication yourself, and parallelize it. Here's how you do it:
std::transform
with the binary operator std::multiplies
to multiply the parts you loaded (this will utilize your processor's vectorization, and make it fast), and fill in the partial result you calculated.This won't be as efficient, but it will work. Also another option is to consider using Armadillo after decomposing your matrix to smaller matrices, whose multiplication will yield sub-results.
Both methods are much slower than the full multiplication for 2 reasons:
Good luck!
Upvotes: 1