Reputation: 21
Introduction:
I am trying to make an autoencoder learn 32 features like position, velocity, etc in 32 time steps => 32x32 ‘image’. For this I just made a simple linear model that uses in every layer the Tanh function with an encoder and a decoder that are symmetric.
During training, I added my own version of dropout for just the input. (in the future I will use the nn.Dropout
)
Problem:
I get large spikes in loss function “sqrt(MSE)” at irregular intervals. (Batch_Size = 6000)
What I have tried: (small test, 1000 epochs max)
clip_grad_norm_(model.parameters(), max_norm = 0.5)
.ReLu
and ELU
.Batch = N / 2
(I wanted to do N
but the memory of my gpu was not enough).Can someone explain to me why this happens and how to fix it?
def rand_bin_array(p_zeros, shape):
size = 1
for e in shape:
size *= e
arr = np.ones(size)
arr[:int(size * p_zeros)] = 0
np.random.shuffle(arr)
arr = arr.reshape(shape)
return arr
class Autoencoder_Liniar(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(1024, 921),
nn.Tanh(),
nn.Linear(921, 736),
nn.Tanh(),
nn.Linear(736, 515),
nn.Tanh(),
nn.Linear(515, 309),
nn.Tanh(),
nn.Linear(309, 128),
nn.Tanh(),
nn.Linear(128, 64),
nn.Tanh(),
)
self.decoder = nn.Sequential(
nn.Linear(64, 128),
nn.Tanh(),
nn.Linear(128, 309),
nn.Tanh(),
nn.Linear(309, 515),
nn.Tanh(),
nn.Linear(515, 736),
nn.Tanh(),
nn.Linear(736, 921),
nn.Tanh(),
nn.Linear(921, 1024),
nn.Tanh()
)
def forward(self, x):
enc = self.encoder(x)
dec = self.decoder(enc)
return dec
torch.manual_seed(0)
model = Autoencoder_Liniar().cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
random.seed(0)
epochs = 10000
batch_size = 6000
test_b_size = 5000
train_losses = []
test_losses = []
for i in range(epochs):
avg_loss = 0
random.shuffle(train_data)
for b in range(train_nr // batch_size):
start = b * batch_size
data = torch.FloatTensor(train_data[start : start + batch_size]).cuda()
noise_power = max(0.8 - i/epochs, 0.1)
noise = torch.FloatTensor(rand_bin_array(noise_power, data.shape)).cuda()
y_pred = model(data * noise)
loss = torch.sqrt(criterion(y_pred, data))
optimizer.zero_grad()
loss.backward()
#torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()
avg_loss += loss.item()
if b % 20 == 0:
print(f'EPOCH: {i} BATCH: {b} LOSS: {loss.item()}')
train_losses.append(avg_loss / (train_nr // batch_size))
with torch.no_grad():
avg_loss = 0
for b in range(test_nr // test_b_size):
start = b * test_b_size
data = np.array(test_data[start : start + test_b_size])
data = torch.FloatTensor(data).cuda()
y_pred = model(data)
loss = torch.sqrt(criterion(y_pred, data))
avg_loss += loss.item()
test_losses.append(avg_loss / (test_nr // test_b_size))
Added code for getting gradient's norm over epochs graph (without noise/dropout)
total_norm = 0
for p in model.parameters():
param_norm = p.grad.detach().data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
avg_grad += total_norm
optimizer.step()
Upvotes: 0
Views: 522
Reputation: 21
The answer was clipping the gradient with the clip_grad_norm_
, but at a lower value.
The value was decided after making the gradient's norm over epochs graph.
Upvotes: 0