Reputation: 399
I am trying to fit a linear regression Ax = b
where A
is a sparse matrix and b
a sparse vector. I tried scipy.sparse.linalg.lsqr
but apparently b
needs to be a numpy (dense) array. Indeed if i run
A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)
I end up with:
AttributeError: squeeze not found
While
scipy.sparse.linalg.lsqr(A,b.toarray())
seems to work.
Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?
Upvotes: 2
Views: 3316
Reputation: 399
Apparently billions of observations is too much for my machine. I ended up:
Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.
Upvotes: 0
Reputation: 7196
It seems that the documentation specifically asks for numpy
array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?
Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||
.
The closed form solution would be x = (A.T*A)^{-1} * A.T *b
.
Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).
You can solve for x
using spsolve
or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.
The code would be:
A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)
which on a few random instances, consistently gave me values less than 1E-7
.
Upvotes: 1