Dan
Dan

Reputation: 21

Autoencoder very weird loss spikes when training

Introduction:

I am trying to make an autoencoder learn 32 features like position, velocity, etc in 32 time steps => 32x32 ‘image’. For this I just made a simple linear model that uses in every layer the Tanh function with an encoder and a decoder that are symmetric.

During training, I added my own version of dropout for just the input. (in the future I will use the nn.Dropout)


Problem:

I get large spikes in loss function “sqrt(MSE)” at irregular intervals. (Batch_Size = 6000)

Loss Graph

What I have tried: (small test, 1000 epochs max)

  1. clip_grad_norm_(model.parameters(), max_norm = 0.5).
  2. Tried ReLu and ELU.
  3. Activation function Batch = N / 2 (I wanted to do N but the memory of my gpu was not enough).
  4. Not adding noise or dropout (the noise/dropout I think helps but does not solve the problem).
  5. Remove the square root on the MSE loss.

Can someone explain to me why this happens and how to fix it?

def rand_bin_array(p_zeros, shape):
    size = 1
    for e in shape:
        size *= e
    arr = np.ones(size)
    arr[:int(size * p_zeros)] = 0
    np.random.shuffle(arr)
    arr = arr.reshape(shape)
    return arr

class Autoencoder_Liniar(nn.Module):
  def __init__(self):
    super().__init__()
    self.encoder = nn.Sequential(
        nn.Linear(1024, 921),
        nn.Tanh(),
        nn.Linear(921, 736),
        nn.Tanh(),
        nn.Linear(736, 515),
        nn.Tanh(),
        nn.Linear(515, 309),
        nn.Tanh(),
        nn.Linear(309, 128),
        nn.Tanh(),
        nn.Linear(128, 64),
        nn.Tanh(),
    )

    self.decoder = nn.Sequential(
        nn.Linear(64, 128),
        nn.Tanh(),
        nn.Linear(128, 309),
        nn.Tanh(),
        nn.Linear(309, 515),
        nn.Tanh(),
        nn.Linear(515, 736),
        nn.Tanh(),
        nn.Linear(736, 921),
        nn.Tanh(),
        nn.Linear(921, 1024),
        nn.Tanh()
    )
  
  def forward(self, x):
    enc = self.encoder(x)
    dec = self.decoder(enc)
    return dec

torch.manual_seed(0)
model = Autoencoder_Liniar().cuda()

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

random.seed(0)
epochs = 10000
batch_size = 6000
test_b_size = 5000

train_losses = []
test_losses = []
for i in range(epochs):
  avg_loss = 0
  random.shuffle(train_data)
  for b in range(train_nr // batch_size):
    start = b * batch_size
    data = torch.FloatTensor(train_data[start : start + batch_size]).cuda()

    noise_power = max(0.8 - i/epochs, 0.1)
    noise = torch.FloatTensor(rand_bin_array(noise_power, data.shape)).cuda()
    y_pred = model(data * noise)
    loss = torch.sqrt(criterion(y_pred, data))

    optimizer.zero_grad()
    loss.backward()
    #torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
    optimizer.step()

    avg_loss += loss.item()
    if b % 20 == 0:
      print(f'EPOCH: {i} BATCH: {b} LOSS: {loss.item()}')
  
  train_losses.append(avg_loss / (train_nr // batch_size))

  with torch.no_grad():
    avg_loss = 0
    for b in range(test_nr // test_b_size):
        start = b * test_b_size
        data = np.array(test_data[start : start + test_b_size])
        data = torch.FloatTensor(data).cuda()

        y_pred = model(data)
        loss = torch.sqrt(criterion(y_pred, data))
        avg_loss += loss.item()

  test_losses.append(avg_loss / (test_nr // test_b_size))

Added code for getting gradient's norm over epochs graph (without noise/dropout)

Gradient clipped at 0.3

    total_norm = 0
    for p in model.parameters():
        param_norm = p.grad.detach().data.norm(2)
        total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5
    avg_grad += total_norm
    
    optimizer.step()

Upvotes: 0

Views: 522

Answers (1)

Dan
Dan

Reputation: 21

The answer was clipping the gradient with the clip_grad_norm_, but at a lower value.

The value was decided after making the gradient's norm over epochs graph.

Graph with clip at 0.2

Upvotes: 0

Related Questions