Wasi Ahmad
Wasi Ahmad

Reputation: 37761

Custom loss function in PyTorch

I have three simple questions.

  1. What will happen if my custom loss function is not differentiable? Will pytorch through error or do something else?
  2. If I declare a loss variable in my custom function which will represent the final loss of the model, should I put requires_grad = True for that variable? or it doesn't matter? If it doesn't matter, then why?
  3. I have seen people sometimes write a separate layer and compute the loss in the forward function. Which approach is preferable, writing a function or a layer? Why?

I need a clear and nice explanation to these questions to resolve my confusions. Please help.

Upvotes: 17

Views: 13838

Answers (1)

mbpaulus
mbpaulus

Reputation: 7711

Let me have a go.

  1. This depends on what you mean by "non-differentiable". The first definition that makes sense here is that PyTorch doesn't know how to compute gradients. If you try to compute gradients nevertheless, this will raise an error. The two possible scenarios are:

    a) You're using a custom PyTorch operation for which gradients have not been implemented, e.g. torch.svd(). In that case you will get a TypeError:

    import torch
    from torch.autograd import Function
    from torch.autograd import Variable
    
    A = Variable(torch.randn(10,10), requires_grad=True)
    u, s, v = torch.svd(A) # raises TypeError
    

    b) You have implemented your own operation, but did not define backward(). In this case, you will get a NotImplementedError:

    class my_function(Function): # forgot to define backward()
    
        def forward(self, x):
            return 2 * x
    
    A = Variable(torch.randn(10,10))
    B = my_function()(A)
    C = torch.sum(B)
    C.backward() # will raise NotImplementedError
    

    The second definition that makes sense is "mathematically non-differentiable". Clearly, an operation which is mathematically not differentiable should either not have a backward() method implemented or a sensible sub-gradient. Consider for example torch.abs() whose backward() method returns the subgradient 0 at 0:

    A = Variable(torch.Tensor([-1,0,1]),requires_grad=True)
    B = torch.abs(A)
    B.backward(torch.Tensor([1,1,1]))
    A.grad.data
    

    For these cases, you should refer to the PyTorch documentation directly and dig out the backward() method of the respective operation directly.

  2. It doesn't matter. The use of requires_gradis to avoid unnecessary computations of gradients for subgraphs. If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.

    Since, there are most likely some Variables (for example parameters of a subclass of nn.Module()), your loss Variable will also require gradients automatically. However, you should notice that exactly for how requires_grad works (see above again), you can only change requires_grad for leaf variables of your graph anyway.

  3. All the custom PyTorch loss functions, are subclasses of _Loss which is a subclass of nn.Module. See here. If you'd like to stick to this convention, you should subclass _Loss when defining your custom loss function. Apart from consistency, one advantage is that your subclass will raise an AssertionError, if you haven't marked your target variables as volatile or requires_grad = False. Another advantage is that you can nest your loss function in nn.Sequential(), because its a nn.Module I would recommend this approach for these reasons.

Upvotes: 21

Related Questions