Reputation: 10631
I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.
Here is the full code of the softmax classifier. Or you can see the link I have provided.
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
I cant understand how they computed the gradient here. I assume that they computed the gradient here -
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
But How? I mean Why gradient of dW
is np.dot(X.T, dscores)
? And Why the gradient of db
is np.sum(dscores, axis=0, keepdims=True)
?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient
?
I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition
is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.
Upvotes: 5
Views: 3306
Reputation: 22368
Backpropagation is to reduce the cost J
of the entire system (softmax classifier here) and it is a problem to optimize the weight parameter W
to minimize the cost. Providing the cost function J = f(W)
is convex, the gradient descent W = W - α * f'(W)
will result in the Wmin
which minimizes J
. The hyperparameter α
is called learning rate which we need to optimize too, but not in this answer.
Y
should be read as J
in the diagram. Imagine you are on the surface of a place whose shape is defined as J = f(W)
and you need to reach the point Wmin
. There is no gravity so you do not know which way is toward the bottom but you know the function and your coordinate. How do you know which way you should go? You can find the direction from the derivative f'(W)
and move to a new coordinate by W = W - α * f'(W)
. By repeating this, you can get closer and closer to the point Wmin
.
At the node where multiply or dot operation happens (affin), the function is J = f(W) = X * W
. Suppose there are m
number of fixed two dimensional coordinates represented as X. How can we find the hyper-plane which minimizes J = f(W) = X * W
and its vector W
?
We can get closer to the optimal W
by repeating the gradient descent W += -α * X
if α is appropriate.
When there are layers after the Affine layer such as the softmax layer and the log loss layer in the softmax classifier, we can calculate the gradient with the chain rule. In the diagram, replace sigmoid with softmax.
As stated in Computing the Analytic Gradient with Backpropagation in the cs321 page, the gradient contribution from the softmax layer and the log loss layer is the dscore part. See the Note section below too.
By applying the gradient to that of the affine layer via the chain rule, the code is derived where α is replaced with step_size. In reality, the step_size needs to be learned as well.
dW = np.dot(X.T, dscores)
W += -step_size * dW
The bias gradient can be derived by applying the chain rule towards the bias b with the gradients (dscore) from the post layers.
db = np.sum(dscores, axis=0, keepdims=True)
As stated in Regularization of the cs231 page, the cost function (objective) is adjusted by adding the regularization, which is reg_loss in the code. It is to reduce the over-fitting. The intuition is, in my understanding, if specific feature(s) cause overfitting, we can reduce it by inflating the cost with their weight parameters W, because the gradient descent will work to reduce the cost contributions from the weights. Since we do not know which ones, use all W. The reason of 0.5 * W*W is because it gives simple derivative W.
reg_loss = 0.5*reg*np.sum(W*W)
The gradient contribution reg*W
is from the derivative of reg_loss. The reg is a hyper parameter to be learned in the real training.
reg_loss/dw -> 0.5 * reg * 2 * W
It is added to the gradient from the layers after the affin.
dW += reg*W # regularization gradient
The process to get the derivative from the cost including the regularization is omitted in the cs231 page referenced in the post, probably because it is a common practice to just put the gradient of the regularization, but confusing for those who are learning. See Coursera Machine Learning Week 3 Cost Function by Andrew Ng for the regularization.
The bias parameter b is substituted with X0 as the bias can be omitted by shifting to the base.
Upvotes: 0
Reputation: 43517
The gradients start being computed here:
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
First, this sets dscores
equal to the probabilities computed by the softmax function. Then, it subtracts 1
from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.
Why does it subtract 1
? Because you want the probabilities of the correct labels to be 1
, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1
, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.
Your activation function is simply w*x + b
. Its derivative with respect to w
is x
, which is why dW
is the dot product between x
and the gradient of the scores / output layer.
The derivative of w*x + b
with respect to b
is 1
, which is why you simply sum dscores
when backpropagating.
Upvotes: 9