Roy
Roy

Reputation: 147

Use Adam optimizer for LSTM network vs LBGFS

I have modified pytorch tutorial on LSTM (sine-wave prediction: given [0:N] sine-values -> [N:2N] values) to use Adam optimizer instead of LBFGS optimizer. However, the model does not train well and cannot predict sine-wave correctly. Since in most cases we use Adam optimizer for RNN training, I wonder how this issue can be resolved. I also wonder if the code segment regarding sequence-in-sequence-out (done with a loop: for input_t in input.split(1, dim=1)), can be done by a pytorch module or function.

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt

class Sequence(nn.Module):
    def __init__(self):
        super(Sequence, self).__init__()
        self.lstm1 = nn.LSTMCell(1, 51)
        self.lstm2 = nn.LSTMCell(51, 51)
        self.linear = nn.Linear(51, 1)

    def forward(self, input, future = 0):
        outputs = []
        h_t = torch.zeros(input.size(0), 51, dtype=torch.double)
        c_t = torch.zeros(input.size(0), 51, dtype=torch.double)
        h_t2 = torch.zeros(input.size(0), 51, dtype=torch.double)
        c_t2 = torch.zeros(input.size(0), 51, dtype=torch.double)

        for input_t in input.split(1, dim=1):
            h_t, c_t = self.lstm1(input_t, (h_t, c_t))
            h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
            output = self.linear(h_t2)
            outputs += [output]
        for i in range(future):# if we should predict the future
            h_t, c_t = self.lstm1(output, (h_t, c_t))
            h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
            output = self.linear(h_t2)
            outputs += [output]
        outputs = torch.cat(outputs, dim=1)
        return outputs


if __name__ == '__main__':
    # set random seed to 0
    np.random.seed(0)
    torch.manual_seed(0)
    # load data and make training set
    data = torch.load('traindata.pt')
    input = torch.from_numpy(data[3:, :-1])
    target = torch.from_numpy(data[3:, 1:])
    test_input = torch.from_numpy(data[:3, :-1])
    test_target = torch.from_numpy(data[:3, 1:])
    print("input.size", input.size())
    print("target.size", target.size())
    # build the model
    seq = Sequence()
    seq.double()
    criterion = nn.MSELoss()
    # use LBFGS as optimizer since we can load the whole data to train
    optimizer = optim.Adam(seq.parameters(), lr=0.005)
    #begin to train
    for i in range(15):
        print('STEP: ', i)
        seq.train()
        def run1step():
            optimizer.zero_grad()
            out = seq(input)
            loss = criterion(out, target)
            print('train loss:', loss.item())
            loss.backward()
            return loss
        run1step()
        optimizer.step()
        # begin to predict, no need to track gradient here
        seq.eval()
        with torch.no_grad():
            future = 1000
            pred = seq(test_input, future=future)
            loss = criterion(pred[:, :-future], test_target)
            print('test  loss:', loss.item())
            y = pred.detach().numpy()
        # draw the result
        def draw(yi, color):
            plt.figure(figsize=(30,10))
            plt.title('Predict future values for time sequences\n(Dashlines are predicted values)', fontsize=30)
            plt.xlabel('x', fontsize=20)
            plt.ylabel('y', fontsize=20)
            plt.xticks(fontsize=20)
            plt.yticks(fontsize=20)
            plt.plot(np.arange(input.size(1)), yi[:input.size(1)], color, linewidth = 2.0)
            plt.plot(np.arange(input.size(1), input.size(1) + future), yi[input.size(1):], color + ':', linewidth = 2.0)
            plt.show()
        if i == 14:
          draw(y[0], 'r')
          draw(y[1], 'g')
          draw(y[2], 'b')
          plt.savefig('predict_LSTM%d.pdf'%i)
          #plt.close()

Upvotes: 0

Views: 1841

Answers (1)

Shihao Xu
Shihao Xu

Reputation: 771

I've just executed your code and the original code. I think the problem is you didn't train your code with ADAM long enough. You can see your training loss is still getting smaller at step 15. So I changed the number of steps from 15 to 45 and this is the figure generated after step 40:

enter image description here

The original code reached 4e-05 loss after step 4. But after that, the loss somehow exploded. Your code with ADAM can reduce the loss across all 45 steps, but the final loss is around 0.001. I hope I run both programs correctly.

Oh, regarding your second question.

also wonder if the code segment regarding sequence-in-sequence-out

Yes, you can write a function or define a module with two LSTMs to do that. But it doesn't make sense since your network contains only two LSTMs. After all, you have to do this “wiring“ work at some point.

If your network contains several such blocks, you can write a module with two LSTMs and use it as a primitive module, e.g. self.BigLSTM = BigLSTM(...) , just like you define self.lstm1 = nn.LSTMCell(...).

Upvotes: 1

Related Questions