mrbTT
mrbTT

Reputation: 1409

How to shape large DataFrame for python's keras LSTM?

I nearly found what I need in the accepted answer here. But had memory issues because the test df provided was only 11 rows.

What I'm trying to predict is using LSTM to forecast 10 days ahead of a Time Series data in a regression model (not classifier!). My dataframe X has around 1500 rows and 2000 features, being of shape (1500, 2000) while the truth values y are just 1500 rows of 1 feature (that can range any value between -1 and 1).

Since LSTM needs 3D vector as an input, I'm really struggling how to reshape the data.

Again, following the example at first paragraph, it crashes for MemoryError when padding values, more specifically at df.cumulative_input_vectors.tolist().

My test (read forecast) is a dataframe of shape (10, 2000).

Due to sensitive data I can't actually share the values/example. How can I help you help me with this?

So, to enable the LSTM to learn from the 1500 rows of y, how should I reshape my x of 1500 rows and 2000 features? Also, how should I reshape my forecast of 10 rows and 2000 features?

They'll undergo -at first because I'm learning LSTM- a simple LSTM model of:

model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(train_X, train_y , epochs=50, batch_size=2, verbose=1)

what I've tried, but when predictin got error:

# A function to make a 3d data of what I understood needed done:
def preprocess_data(stock, seq_len):
    amount_of_features = len(stock.columns)
    data = stock.values

    sequence_length = seq_len #+ 1
    result = []
    for index in range(len(data) - sequence_length):
        result.append(data[index : index + sequence_length])

    X_train = np.array(result)  

    X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], amount_of_features))

    return X_train

# creating the train as:
# X == the DF of 1500 rows and 2000 features
window = 10
train_X = preprocess_data(X[::-1], window)

Upvotes: 3

Views: 2619

Answers (1)

mrbTT
mrbTT

Reputation: 1409

After a while I've managed to understand properly what the dimensions where. Keras expect a 3d array of .shape (totalRows, sequences, totalColumns). The sequences one was the most confusing to me.

That was because when reshaping the df df.reshape(len(df), 1, len(df.columns)) meaning keras would learn for a matrix of 1 line it gave me bad results because I didn't know it's best to scale the data for me MinMaxScaler(-1,1) worked best, but could be (0,1).

What made me understand that was to first use sequence of more than 1 row (or days, since my dataset was a Time Series). Meaning that instead of feeding 1 row of features X results in 1 value of y, I used something like 5 rows of features X results in 1 value of y. as in:

# after scaling the df, resulted in "scaled_dataset"
sequences = 5
result = []
# for loop will walk for each of the 1500 rows
for i in range(0,len(scaled_dataset)):
    # every group must have the same length, so if current loop position i + number 
    # of sequences is higher than df length, breaks
    if i+sequences <= len(scaled_dataset):
        # this will add into the list as [[R1a,R1b...R1t],[R2a,R2b...R2t],...[R5a,R5b...R5t]]
        result.append(scaled_dataset[i:i+sequences].values)
# Converting to array + keras takes float32 better than 64
train_x = np.array(result).astype('float32')
# making the y into same length as X
train_y = np.array(y.tail(train_x.shape[0]).values)

train_x.shape, train_y.shape

'>>> (1495, 5, 2400), (1495,)

Writen in another way the mentality on keras shapes for my problem:

Considering it's a time series, above means that 5 days (rows 0 to 4) of data results in the value y of row 5.

Then, less the first day + the next day after last - still 5 days - (rows 1 to 5) of data results in the value y of row 6.

Then, less the second day + the next day after last - still 5 days - (rows 2 to 6) of data results in the value y of row 7.

It's quite confusing for starters of keras/LSTM, but I hope I could details this for people who might land here.

Upvotes: 2

Related Questions