How can I implement the input of multiple regression in LSTM using keras?

Question

here is my code

def create_dataset(signal_data, look_back=1):
    dataX, dataY = [], []
    for i in range(len(signal_data) - look_back):
        dataX.append(signal_data[i:(i + look_back), 0])
        dataY.append(signal_data[i + look_back, 0])
    return np.array(dataX), np.array(dataY)

df = pd.read_csv('time_series.csv')
signal_data = df.Close.values.astype('float32')
signal_data = signal_data.reshape(len(df), 1)


scaler = MinMaxScaler(feature_range=(0, 1))
signal_data = scaler.fit_transform(signal_data)

train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size)
# val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
# val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]

x_train, y_train = create_dataset(train, look_back)
# x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)


x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# x_val = np.reshape(x_val, (x_val.shape[0], x_val.shape[1], 1))
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

now I want to add df.Open and df.High and df.Low and df.Volume

how can I implement this code?

Should I just add to the signal data? I'm wondering how to add data so that I can train multiple features in the signal data.

I don't know where and how to implement it. I need your help.

Your valuable opinions and thoughts will be very much appreciated.

ivallesp · Accepted Answer

I made several modifications to your code. This should work. In summary:

I got fixed the lines of code where you were barcoding the selection of the variable 0. Now, the target variable stands on the last position and the others in the previous ones
I fixed the reshapes some of them were not needed and the others were fixed to keep all the dimensions
I fixed the model input shape, now you have 5 variables instead of 1

My general recommendations:

I would not use MinMaxScaler, it is dangerous because a single outlier can disturb all your distribution. Instead, use StandardScaler. More info here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
I would scale the data afterwards, when the train_x, test_x and their y respective counterparts are built. The reason why is because you are computing the statistics for scaling the data using the train and test set, i.e. future information. This is by all the means different to what you'll find when you try to run your code in a real situation. I.e. you'll have to scale the new data with past statistics. It is better to build a test set as close to the reality as possible.
How do you know that your model is big enough to model your data? I would get rid of the dropouts and run the model to see if it can overfit the data. If the model can overfit to the train data, it means that the model is big enough and you can start regularising your model to enhance generalisation. More info in this book: https://www.deeplearning.ai/machine-learning-yearning/
In the model metrics you choose accuracy, which is a classification metric. I would use one according to my type of problem (regression): for example "Mean Absolute Error".

I hope I managed to help you :D

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Conv2D, Reshape, TimeDistributed, Flatten, Conv1D,ConvLSTM2D, MaxPooling1D
from keras.layers.core import Dense, Activation, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import matplotlib.pyplot as plt


config = tf.ConfigProto()
config.gpu_options.allow_growth=True

sess = tf.Session(config=config)
def create_dataset(signal_data, look_back=1):
    dataX, dataY = [], []
    for i in range(len(signal_data) - look_back):
        dataX.append(signal_data[i:(i + look_back), :])
        dataY.append(signal_data[i + look_back, -1])
    return np.array(dataX), np.array(dataY)


look_back = 20



df = pd.read_csv('kospi.csv')
signal_data = df[["Open", "Low", "High", "Volume", "Close"]].values.astype('float32')


scaler = MinMaxScaler(feature_range=(0, 1))
signal_data = scaler.fit_transform(signal_data)



train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size - int(len(signal_data) * 0.05)
val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]



x_train, y_train = create_dataset(train, look_back)
x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)




model = Sequential()
model.add(LSTM(128, input_shape=(None, 5),return_sequences=True))
model.add(Dropout(0.3))

model.add(LSTM(128, input_shape=(None, 5)))
model.add(Dropout(0.3))

model.add(Dense(128))
model.add(Dropout(0.3))

model.add(Dense(1))




model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])


model.summary()
hist = model.fit(x_train, y_train, epochs=20, batch_size=32, verbose=2, validation_data=(x_val, y_val))

trainScore = model.evaluate(x_train, y_train, verbose=0)
model.reset_states()
print('Train Score: ', trainScore)
valScore = model.evaluate(x_val, y_val, verbose=0)
model.reset_states()
print('Validataion Score: ', valScore)
testScore = model.evaluate(x_test, y_test, verbose=0)
model.reset_states()
print('Test Score: ', testScore)



p = model.predict(x_test)


print(mean_squared_error(y_test, p))

import matplotlib.pyplot as pplt

pplt.plot(y_test)
pplt.plot(p)
pplt.legend(['testY', 'p'], loc='upper right')
pplt.show()

How can I implement the input of multiple regression in LSTM using keras?

Answers (1)

Related Questions