cadama
cadama

Reputation: 399

sparse least square regression

I am trying to fit a linear regression Ax = b where A is a sparse matrix and b a sparse vector. I tried scipy.sparse.linalg.lsqr but apparently b needs to be a numpy (dense) array. Indeed if i run

A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)

I end up with:

AttributeError: squeeze not found

While

scipy.sparse.linalg.lsqr(A,b.toarray())

seems to work.

Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?

Upvotes: 2

Views: 3316

Answers (2)

cadama
cadama

Reputation: 399

Apparently billions of observations is too much for my machine. I ended up:

  1. Changing algorithm to Stochastic Gradient Descent (SGD): faster with many obs
  2. Removing completely sparse examples (i.e. features and label equal to zero)

Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.

Upvotes: 0

Nitish
Nitish

Reputation: 7196

It seems that the documentation specifically asks for numpy array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?

Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||.

The closed form solution would be x = (A.T*A)^{-1} * A.T *b. Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).

You can solve for x using spsolve or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.

The code would be:

A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)

which on a few random instances, consistently gave me values less than 1E-7.

Upvotes: 1

Related Questions