Silas
Silas

Reputation: 13

Derivative of Scalar Expansion in PyTorch

I'm working on implementing a very simple auto diff library in Rust to expand my knowledge on how it is done. I have most everything working, but when implementing negative log likelihood, I realized that I have some confusion on how to handle the derivative for the following scenario (I've written it in PyTorch below).

x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = x - torch.sum(x)

I've looked around, experimented, and am still a little confused on what is actually happening here. I know that the derivative with respect to x of the equation above is [-2, -2, -2], but there are a number of ways to get there, and when I expand the equation to the following:

x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = torch.exp(x - torch.sum(x))

I am completely lost and have no idea how it derived the gradients for x.

I'm assuming the above equations are being rewritten to something like this:

y = (x - [torch.sum(x), torch.sum(x), torch.sum(x)])

but am not sure, and am really struggling to find info on the topic of scalars being expanded to vectors or whatever is actually happening. If someone could point me in the right direction that would be awesome!

If helpful, I can include the gradients pytorch computes of the above equations.

Upvotes: 0

Views: 303

Answers (2)

kmkurn
kmkurn

Reputation: 685

Your code won't work with PyTorch without any modification because it doesn't specify what the gradients w.r.t to y are. You need them to call y.backward() which computes the gradients w.r.t to x. From your all -2 result, I figured the gradients must be all ones.

The "scalar expansion" is called broadcasting. As you already know, broadcasting is performed whenever two tensor operands have mismatched shapes. My guess is that it is implemented in the same way as any other operation in PyTorch that knows how to compute the gradients w.r.t its inputs given the gradients w.r.t its outputs. A simple example is given below which (a) works with your given test case and (b) allows us to still use PyTorch's autograd to compute the gradients automatically (see also PyTorch's docs on extending autograd):

class Broadcast(torch.autograd.Function):
    def forward(ctx, x: torch.Tensor, length: int) -> torch.Tensor:
        assert x.ndim == 0, "input must be a scalar tensor"
        assert length > 0, "length must be greater than zero"
        return x.new_full((length,), x.item())

    def backward(ctx, grad: torch.Tensor) -> Tuple[torch.Tensor, None]:
        return grad.sum(), None

Now, by setting broadcast = Broadcast.apply we can call the broadcasting ourselves instead of letting PyTorch perform it automatically.

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - broadcast(torch.sum(x), x.size(0))
y.backward(torch.ones_like(y))
assert torch.allclose(torch.tensor(-2.), x.grad)

Note that I don't know how PyTorch actually implements it. The implementation above is just to illustrate how the broadcasting operation might be written for automatic differentiation to work, which hopefully answers your question.

Upvotes: 0

kung_fu_coder
kung_fu_coder

Reputation: 166

Firstly, a few things, the argument is requires_grad and not require_grads. Second, you can only require gradients for a floating point or complex dtype.

Now, a scalar addition/multiplication (note that subtraction/division which can be viewed as addition with a -ve number/multiplication with a fraction) simply adds/multiplies the scalar with all elements of the tensor. Hence,

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - 1

evaluates to:

y = tensor([-1.,  0.,  1.], grad_fn=<SubBackward0>)

Thus, in your case, torch.sum(x) is basically a scalar quantity that is subtracted from all the elements of the tensor x.

If you are more interested on the gradient part, check the pytorch documentation on autograd [ref]. It states the following:

The graph is differentiated using the chain rule. If any of tensors are non-scalar (i.e. their data has more than one element) and require gradient, then the Jacobian-vector product would be computed, in this case the function additionally requires specifying grad_tensors. It should be a sequence of matching length, that contains the “vector” in the Jacobian-vector product, usually the gradient of the differentiated function w.r.t. corresponding tensors (None is an acceptable value for all tensors that don’t need gradient tensors).

Upvotes: 0

Related Questions