How can I calculate the network gradients w.r.t weights for all inputs in PyTorch?

Question

I'm trying to figure out how I can calculate the gradient of the network for each input. And I'm a bit lost. Essentially, what I would want, is to calculate d self.output/d weight1 and d self.output/d weight2 for all values of input x. So, I would have a matrix of size (1000, 5) for example. Where the 1000 is for the size of the input x, and 5 is the number of weights in the layer.

The example I've included below returns weights as size (1,5). What exactly is being calculated here? Is this d self.output/ d weight1 for 1 input of x, or an average of all inputs?

Secondly, would a matmul of features.grad and weight1.grad be the same as what I'm asking? A matrix of all the gradients of weight1 for all values of x.

class Network(torch.nn.Module):

    def __init__(self, iNode, hNode, oNode):
        super(Network, self).__init__()

        print("Building Model...")

        iNode = int(iNode) ; self.iNode = iNode
        hNode = int(hNode) ; self.hNode = hNode
        oNode = int(oNode) ; self.oNode = oNode

        self.fc1 = nn.Linear(iNode, hNode, bias=False)
        self.fc2 = nn.Linear(hNode, oNode, bias=False)

    def forward(self, x):
        self.hidden_probs = self.fc1(x)
        self.hidden = self.actFunc1(self.hidden_probs)
        self.output_probs = self.fc2(self.hidden)
        self.output = self.actFunc2(self.output_probs)
        return self.output

    def actFunc1(self, x):
        return 1.0/(1.0+torch.exp(-x))

    def actFunc2(self, x):
        return x

    def trainData(self, features, labels, epochs, alpha, optimisation, verbose=False):

        for epoch in range(0,epochs):
            net_pred = self.forward(features)
            net_pred.backward(gradient=torch.ones(features.size())) #calc. dout/dw for all w
print(features.grad.size()) #returns (1000,1)



            with torch.no_grad():
                for name, param in self.named_parameters():
                    if(param.requires_grad):
                        param -= alpha*param.grad

                for name, param in self.named_parameters():
                    if(param.requires_grad):
                        param.grad.zero_()


            sys.stdout.write("Epoch: %06i
" % (epoch))
            sys.stdout.flush()
        sys.stdout.write("
")

Lennart · Accepted Answer

I am not sure what exactly you are trying to achieve because normally you only work with the sum of gradients of (d output)/(d parameter) and not with any other gradients in between as autograd takes care that, but let me try to answer.

Question 1

The example I've included below returns weights as size (1,5). What exactly is being calculated here? Is this d self.output/ d weight1 for 1 input of x, or an average of all inputs?

You get size (1,5) because training is done in mini batches, meaning the gradients for each data point with respect to the (5) weights are calculated and summed over the mini batch. According to the docs:

This attribute is None by default and becomes a Tensor the first time a call to backward() computes gradients for self. The attribute will then contain the gradients computed and future calls to backward() will accumulate (add) gradients into it.

If you explicitly want the gradient for each data point, then make your mini batch size one. Normally we train in mini batches because updating after each data point can be unstable, image jumping in a different direction each time, where with a batch this would average out. On the other extreme, many data sets are simply too large to calculate the gradient in one go.

Question 2

An example might give more insight:

    import torch
    x = torch.tensor([1.5], requires_grad=True)
    a = torch.nn.Parameter(torch.tensor([2.]))
    b = torch.nn.Parameter(torch.tensor([10.]))
    y = x*a
    z = y+0.5*b
    temp = z.backward()
    print('gradients of a: %0.2f and b: %0.2f' % (a.grad.item(), b.grad.item()))

I start with two parameters, a and b, and calculate z=a*x+0.5*b. No gradients are calculated yet, pytorch only keeps track of the history of operations, so all .grad attributes are empty. When z.backward() is called, the gradients of the output with respect to the parameters are calculated, which you can view by calling grad on the parameters.

Updating the parameters can then be done like you are already doing a -= alpha*a.grad.

How can I calculate the network gradients w.r.t weights for all inputs in PyTorch?

Answers (1)

Related Questions