Abhishek Bhatia
Abhishek Bhatia

Reputation: 9806

requires_grad relation to leaf nodes

From the docs:

requires_grad – Boolean indicating whether the Variable has been created by a subgraph containing any Variable, that requires it. Can be changed only on leaf Variables

  1. What does it mean by leaf nodes here? Are leaf nodes only the input nodes?
  2. If it can be only changed at the leaf nodes, how can I freeze layers then?

Upvotes: 7

Views: 13708

Answers (1)

mbpaulus
mbpaulus

Reputation: 7691

  1. Leaf nodes of a graph are those nodes (i.e. Variables) that were not computed directly from other nodes in the graph. For example:

    import torch
    from torch.autograd import Variable
    
    A = Variable(torch.randn(10,10)) # this is a leaf node
    B = 2 * A # this is not a leaf node
    w = Variable(torch.randn(10,10)) # this is a leaf node
    C = A.mm(w) # this is not a leaf node
    

    If a leaf node requires_grad, all subsequent nodes computed from it will automatically also require_grad. Else, you could not apply the chain rule to calculate the gradient of the leaf node which requires_grad. This is the reason why requires_grad can only be set for leaf nodes: For all others, it can be smartly inferred and is in fact determined by the settings of the leaf nodes used for computing these other variables.

  2. Note that in a typical neural network, all parameters are leaf nodes. They are not computed from any other Variables in the network. Hence, freezing layers using requires_gradis simple. Here, is an example taken from the PyTorch docs:

    model = torchvision.models.resnet18(pretrained=True)
    for param in model.parameters():
        param.requires_grad = False
    
    # Replace the last fully-connected layer
    # Parameters of newly constructed modules have requires_grad=True by default
    model.fc = nn.Linear(512, 100)
    
    # Optimize only the classifier
    optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)
    

    Even though, what you really do is freezing the entire gradient computation (which is what you should be doing as it avoids unnecessary computation). Technically, you could leave the requires_grad flag on, and only define your optimizer for a subset of the parameters that you would like to learn.

Upvotes: 19

Related Questions