indecisivecoder
indecisivecoder

Reputation: 267

Stochastic Gradient Descent Convergence Criteria

Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.

def compute_mse(data, labels, weights):
    m = len(labels)
    hypothesis = np.dot(data,weights)
    sq_errors = (hypothesis - labels) ** 2
    mse = np.sum(sq_errors)/(2.0*m)
    return mse

cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
    prev_mse = cur_mse

    for i in range(m):
        d = np.array(data[i])
        hypothesis = np.dot(d, weights)
        gradient = np.dot((labels[i] - hypothesis), d)/m
        weights = weights + (alpha * gradient)

    cur_mse = compute_mse(data, labels, weights)
    if cur_mse > prev_mse:
        return 

The weights are update w.r.t. to a single data point in the training set.

With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?

Upvotes: 0

Views: 3214

Answers (1)

Imanol Luengo
Imanol Luengo

Reputation: 15909

I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:

input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)

while True:
    shuffle data # good practice, not completely needed
    for d in data:
        x -= alpha * grad(f(x)) # df/dx
    if <stopping criterion>:
        break

There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.

Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:

Ax = b

which yields the objevtive function:

f(x) = ||Ax - b||^2

stochastic gradient descent uses one row data at a time:

||A_i x - b||

where || o || is the euclidean norm and _i means index of a row.

Here, A is your data, x is your weights and b is your labels.

The gradient of the function is then computed as a:

grad(f(x)) = 2 * A.T (Ax - b)

Or in the case of the stochastic gradient descent:

2 * A_i.T (A_i x - b)

where .T means transpose.

Putting everything back into your code... first I will setup a synthetic data:

A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}

Then, define the parameters:

alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)

And loop!

while cur_mse/prev_mse < 0.99999 and it < max_iter:
    prev_mse = cur_mse
    shuffle(idx)

    for i in idx:
        d = A[i:i+1]
        y = b[i:i+1]
        h = np.dot(d, x)
        dx = 2 * np.dot(d.T, (h - y))
        x -= (alpha * dx)

    cur_mse = np.mean((A.dot(x) - b)**2)
    if cur_mse > prev_mse:
        raise Exception("Not converging")
    it += 1

This code is pretty much the same as yours, with a couple of additions:

  • Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)

  • Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).

  • Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.

You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.

This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).

So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).

Hope it helps!

Upvotes: 2

Related Questions