Reputation: 113
I have posted this question on Data Science StackExchange site since StackOverflow does not support LaTeX. Linking it here because this site is probably more appropriate.
The question with correctly rendered LaTeX is here: https://datascience.stackexchange.com/questions/48062/pytorch-does-not-seem-to-be-optimizing-correctly
The idea is that I am considering sums of sine waves with different phases. The waves are sampled with some sample rate s
in the interval [0, 2pi]. I need to select phases in such a way, that the sum of the waves at any sample point is minimized.
Below is the Python code. Optimization does not seem to be computed correctly.
import numpy as np
import torch
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros([n, 1], requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
Below is a sample output.
phaseOptimize(5, nsteps=100)
Optimal theta:
tensor([[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07],
[1.2812e-07]], requires_grad=True)
Maximum value: 5.0
I am assuming this has something to do with broadcasting in
T = t + theta
and/or the way I am computing the loss function.
One way to verify that optimization is incorrect, is to simply evaluate the loss function at random values for the array $\theta_1, \dots, \theta_n$, say uniformly distributed in $[0, 2\pi]$. The maximum value in this case is almost always much lower than the maximum value reported by phaseOptimize()
. Much easier in fact is to consider the case with $n = 2$, and simply evaluate at $\theta_1 = 0$ and $\theta_2 = \pi$. In that case we get:
phaseOptimize(2, nsteps=100)
Optimal theta:
tensor([[2.8599e-08],
[2.8599e-08]])
Maximum value: 2.0
On the other hand,
theta = torch.FloatTensor([[0], [np.pi]])
l = torch.linspace(0, 2 * np.pi, 48000)
t = torch.stack([l] * 2)
T = t + theta
T.sin().sum(0).abs().max().item()
produces
3.2782554626464844e-07
Upvotes: 1
Views: 173
Reputation: 13103
You're being bitten by both PyTorch and math. Firstly, you need to
theta.grad = None
before each backward
step. Otherwise the gradients accumulate instead of overwriting the previous onesT
at each step. PyTorch is not symbolic, unlike TensorFlow and T = t + theta
means "T equals the sum of current t
and current theta
" and not "T equals the sum of t
and theta
, whatever their values may be at any time in the future".With those fixes you get the following code:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-3
theta = torch.zeros(n, 1, requires_grad=True)
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
T = t + theta
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
theta.grad = None
loss.backward()
theta.data -= learning_rate * theta.grad.data
T = t + theta
print('Optimal theta: \n\n', theta.data)
print('\n\nMaximum value:', T.sin().sum(0).abs().max().item())
which will still not work as you expect because of math.
One can easily see that the minimum to your loss function is when theta
are also uniformly spaced over [0, 2pi)
. The problem is that you are initializing your parameters as torch.zeros
, which leads to all those values being equal (this is the polar opposite of equispaced!). Since your loss function is symmetrical with respect to permutations of theta
, the computed gradients are equal and the gradient descent algorithm can never "differentiate them". In more mathematical terms, you're unlucky enough to initialize your algorithm exactly on a saddle point, so it cannot continue. If you add any noise, it will converge. For instance with
theta = torch.zeros(n, 1) + 0.001 * torch.randn(n, 1)
theta.requires_grad_(True)
Upvotes: 1
Reputation: 7209
You have to move computing T
inside the loop, or it will always have the same constant value, thus constant loss.
Another thing is to initialize theta
to different values at indices, otherwise because of the symmetric nature of the problem the gradient is the same for every index.
Another thing is that you need to zero gradient, because backward
just accumulates them.
This seems to work:
def phaseOptimize(n, s = 48000, nsteps = 1000):
learning_rate = 1e-1
theta = torch.zeros([n, 1], requires_grad=True)
theta.data[0][0] = 1
l = torch.linspace(0, 2 * np.pi, s)
t = torch.stack([l] * n)
for jj in range(nsteps):
T = t + theta
loss = T.sin().sum(0).pow(2).sum() / s
loss.backward()
theta.data -= learning_rate * theta.grad.data
theta.grad.zero_()
Upvotes: 2