yutseho
yutseho

Reputation: 1689

How backpropagation works in torch 7?

I tried to understand supervised learning by torch tutorial.

http://code.madbits.com/wiki/doku.php?id=tutorial_supervised

And backpropagation :

http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

As I know, parameter update in this torch tutorial is in Step 4 Training Procedure,

output = model:forward(inputs[i])
df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)

For example, I got this

output = -2.2799
         -2.3638
         -2.3183
         -2.1955
         -2.3377
         -2.3434
         -2.3740
         -2.2641
         -2.3449
         -2.2214
         [torch.DoubleTensor of size 10]

targets[i] = 9

df_do is this ?

0
0
0
0
0
0
0
0
-1
0
[torch.DoubleTensor of size 10]

I know the target is 9 and the output is 4 in this example, so the result is wrong and give the 9-th element of df_do "-1".

But why ?

According to http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html,

df_do is [ target (desired output) - output ].

Upvotes: 1

Views: 2038

Answers (1)

Alexander Lutsenko
Alexander Lutsenko

Reputation: 2160

In Torch backprop works exactly as it does in mathematics. df_do is a derivative of loss w.r.t. prediction, and therefore entirely defined by your loss function, i.e. nn.Criterion. The most famous one is Mean Square Error (nn.MSECriterion): enter image description here

Note that MSE criterion expects target to have the same size as prediction (a one-hot vector for classification). If you choose MSE, your derivative vector df_do will be computed as:

enter image description here

The MSE criterion, however, is typically not very good for classification. The more suitable one is Likelihood criterion, which takes a probability vector as prediction and a scalar index of the true class as target. The aim is to simply maximize probability of the true class, that equals to minimization of its negative:

enter image description here

If we give it log-probability vector qua prediction (it is a monotone transformation and thus doesn't affect the optimization result but more computationally stable), we'll get the Negative Log Likelihood loss function (nn.ClassNLLCriterion):

enter image description here

In that case, df_do is as follows:

enter image description here

In the torch tutorial NLL criterion is used by default.

Upvotes: 4

Related Questions