Reputation: 786
I'm debugging my constrained stochastic gradient descent algorithm and the paper http://research.microsoft.com/pubs/192769/tricks-2012.pdf suggests to check the gradients using finite differences. I added a penalty function, but the model does not converge anymore, so i want to check my gradient as suggested in the paper.
So I can pick an example and compute the loss of this example, but my weight vector contains of ~4000 features, so i get a vector of that many partial derivatives as my gradient while the loss is an Integer, so it's not possible to compute Q(z, w) + δg. Do i have to compute the loss for a single feature of w only? Is that meant by "the current w"?
Upvotes: 1
Views: 1915
Reputation: 66815
The equation in the publication looks weird as it is not carefully described. In order to check the gradient you usually check if the difference between your "guessed" gradient g
and
numerical gradient, which i
th dimension equals
( Q(z, w + delta*e_i) - Q(z, w) ) / ( delta )
for small enough delta
, and e_i
being i
th canonical vector (with 1 on i
th dimension and 0 otherwise) is small enough. In other words if we denote by g_i
the i
th dimension of your gradient then you need to check if
| ( Q(z, w + delta*e_i) - Q(z, w) ) / ( delta ) - g_i | < eps
| Q(z, w + delta*e_i) - Q(z, w) - delta * g_i | < delta*eps
which boils down to checking
| Q(z, w + delta*e_i) - ( Q(z, w) + delta * g_i ) | < delta*eps
thus check if
Q(z, w + delta*e_i) ≈ ( Q(z, w) + delta * g_i )
which is their equation, simply feature-wise.
Upvotes: 0