Reputation: 155
As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y)
and OLS_builtin(X,y)
, whereas OLS_coord_descent_fast(X,y)
returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y)
agrees with the other two.
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
Upvotes: 3
Views: 1291
Reputation: 18227
The quick answer: You forgot to update r
after each x[j]
update. Following is the fixed function which behaves like OLS_coord_descent
:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end
Upvotes: 5