Orthogonality regularization penalty for gradient methods?

Say I have a parameter matrix W, which I am learning using a gradient descent method.

If I have reason to believe that the columns of W should be roughly orthogonal to one another, is there a specific regularization that I can impose upon this matrix?

it seems to me something like:

W^TW -diag(W^TW)

would penalize the off-diagonal elements of W^TW, which roughly correspond to the columns of W being orthogonal.

However, this isn't entirely differentiable to my knowledge. Any other methods I should be aware of?

Upvotes: 0

Answers (1)

lejlot

Reputation: 66805

Every part of sum([W'W - diag(W'W)]^2) (you need ^2 or abs to remove the sign, otherwise you could have matrix like [[1 -100] [100 1]] with cost 0 even though it is not orthogonal) is differentiable, why would you think otherwise? There are just additions and multiplications involved, nothing else.

The bigger problem is computational complexity, as given W is d x n both forward and backward pass will have O(n^2d) complexity. So if this is a neural net layer, with 1000 units, such penalty requires 1,000,000,000 computations (as opposed to 1,000,000 in normal backprop). In general one rather should avoid pairwise penalties in the weight space. You could reduce this by doing regularisation of such sort in a stochastic manner (similarly to dropout - just sample randomly K units and apply penlty only to them).

Upvotes: 1

Orthogonality regularization penalty for gradient methods?

Answers (1)

Related Questions