Reputation: 267
Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?
Upvotes: 0
Views: 3214
Reputation: 15909
I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization
parameters added to the function that you want to minimize, such as the l1 penalty
to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o ||
is the euclidean norm and _i
means index of a row.
Here, A
is your data
, x
is your weights
and b
is your labels
.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T
means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx
(still similar to yours). You have the sign inverted and therefore the weight update is positive +
since in my example is negative -
(makes sense since you are going down in a gradient).
Indexing of data
and labels
. While data[i]
gives a tuple of size (2,)
(in this case for a 100x2
data), using fancy indexing data[i:i+1]
will return a view of the data without reshaping it (e.g with shape (1, 2)
) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse
error, i.e: if cur_mse < 1e-3: break
.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A
is more wide than high).
Hope it helps!
Upvotes: 2