dev_user
dev_user

Reputation: 417

How to build an LSTM time-series forecasting model in python?

I'm trying to build an LSTM model, data consists of date_time & some numeric values. While fitting the model, its getting

"ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (10, 1)" error.

Sample data: "date.csv" looks like:

Date

06/13/2018 07:20:04 PM

06/13/2018 07:20:04 PM

06/13/2018 07:20:04 PM

06/13/2018 07:22:12 PM

06/13/2018 07:22:12 PM

06/13/2018 07:22:12 PM

06/13/2018 07:26:20 PM

06/13/2018 07:26:20 PM

06/13/2018 07:26:20 PM

06/13/2018 07:26:20 PM

"tasks.csv" looks like :

Tasks

2

1

2

1

4

2

3

2

3

4
    date = pd.read_csv('date.csv')
    task = pd.read_csv('tasks.csv')
    model = Sequential()
    model.add(LSTM(24,return_sequences=True,input_shape=(date.shape[0],1)))
    model.add(Dense(1))
    model.compile(loss="mean_squared_error", optimizer="adam")
    model.fit(date, task,  epochs=100,  batch_size=1,  verbose=1)

How can I forecast the result?

Upvotes: 1

Views: 2426

Answers (1)

Mikhail Stepanov
Mikhail Stepanov

Reputation: 3790

There are some issues with this code sample. Therea are lack of preprocessing, label encoding, target encoding and incorrect loss function. I briefly describe possible solutions, but for more information and examples you can read a tutorial about time-series and forecasting.

Adressing specific problem which generates this ValueError is: LSTM requires a three-dimensional input. The shape of it is (batch_size, input_length, dimension). So, it requires an input of some values at least (batch_size, 1, 1) - but date.shape is (10, 1). If you do

date = date.values.reshape((1, 10, 1)) 

- it will solve this one problem, but brings an avalanche of other problems:

date = date.values.reshape((1, 10, 1))

model = Sequential()
model.add(LSTM(24, return_sequences=True, input_shape=(date.shape[1], 1)))
print(model.layers[-1].output_shape)
model.add(Dense(1))
model.compile(loss="mean_squared_error", optimizer="adam")
model.fit(date, task,  epochs=100,  batch_size=1,  verbose=1)

ValueError: Input arrays should have the same number of samples as target arrays. Found 1 input samples and 10 target samples.

Unfortunately, there's no answers to other questions, because of a lack of information. But some general-purpose recommendations.

Preprocessing
Unfortunately, you probably can't just reshape because the forecasting is little less complicated thing. You should choose some periond based on you will forecast next task. Good news, there is periodic measurements, but for each time there are several tasks, which maked the task harder to solve.

Features
You should have a features to predict something. It's not clear what is feature this case, but perhaps not a date and time. Even the previous task could be a features, but you can't use just the task id, it requires some embedding, as it's not a continuous numeric value but a label.

Embedding
There's a keras.layers.Embedding for embedding of something in keras.

If the number of tasks is 4 (1, 2, 3, 4) and the shape of the output vector is, you could use this way:

model = Sequential()
model.add(Embedding(4 + 1, 10, input_length=10))  # + 1 to deal with non-zero indexing
# ... the reso of the code is omitted

- the first argument is a number of embedded items, second is an output shape, and the latter is input length (10 is just an example value).

Label encoding
Probably task labels just a labels, there's no reasonable distance or metric between them - i.e. you can't say 1 is closer to 2 than to 4 etc. That case mse is useless, but fortunately exists a probabilistic loss function named categorical cross-entropy which helps to predict a category of data.

To use it, you shoul binarize labels:

import numpy as np

def binarize(labels):
    label_map = dict(map(reversed, enumerate(np.unique(labels))))
    bin_labels = np.zeros((len(labels), len(label_map)))
    bin_labels[np.arange(len(labels)), [label_map[label] for label in labels]]  = 1
    return bin_labels, label_map

binarized_task, label_map = binarize(task)
binarized_task
Out:
array([[0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]]
label_map
Out:
{1: 0, 2: 1, 3: 2, 4: 3}

- binarized labels and the collection of "task-is's position in binary labels".
Of course, you should use cross-entropy loss in model with binarized labels. Also, the last layer should use softmax activation function (explained in tutorial about cross-entropy; shortly, you deal with a probabbility of a label, so, it should be sumed up to one, and softmax modifies previous layer values according to this requirement):

model.add(Dense(4, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="adam")
model.fit(date, binarized_task, epochs=100, batch_size=1,  verbose=1)

"Complete", but, probably, meaningless example
This example uses all the things listed above, but it doesn't pretend to be complete or useful - but, I hope, it is explanatory at least.

import datetime
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Flatten, Embedding

# Define functions

def binarize(labels):
    """
    Labels of shape (size,) to {0, 1} array of the shape (size, n_labels)
    """
    label_map = dict(map(reversed, enumerate(np.unique(labels))))
    bin_labels = np.zeros((len(labels), len(label_map)))
    bin_labels[np.arange(len(labels)), [label_map[label] for label in labels]]  = 1
    return bin_labels, label_map


def group_chunks(df, chunk_size):
    """
    Group task date by periods, train on some columns and use lask ('Tasks') as the target. Function uses 'Tasks' as a features.
    """
    chunks = []
    for i in range(0, len(df)-chunk_size):
        chunks.append(df.iloc[i:i + chunk_size]['Tasks'])  # slice period, append 
        chunks[-1].index = list(range(chunk_size))
    df_out = pd.concat(chunks, axis=1).T  
    df_out.index = df['Date'].iloc[:(len(df) - chunk_size)]
    df_out.columns = [i for i in df_out.columns[:-1]] + ['Tasks']
    return df_out


# I modify this date for simlicity - now it's a single entry for each datetime
date = pd.DataFrame({
    "Date" : [
        "06/13/2018 07:20:00 PM",
        "06/13/2018 07:20:01 PM",
        "06/13/2018 07:20:02 PM",
        "06/13/2018 07:20:03 PM",
        "06/13/2018 07:20:04 PM",
        "06/13/2018 07:20:05 PM",
        "06/13/2018 07:20:06 PM",
        "06/13/2018 07:20:07 PM",
        "06/13/2018 07:20:08 PM",
        "06/13/2018 07:20:09 PM"]
})

task = pd.DataFrame({"Tasks": [2, 1, 2, 1, 4, 2, 3, 2, 3, 4]})
date['Tasks'] = task['Tasks']
date['Date'] = date['Date'].map(lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p"))  # formatting datetime as datetime


chunk_size = 4
df = group_chunks(date, chunk_size)
# print(df)
"""
                     0  1  2  Tasks
Date                               
2018-06-13 19:20:00  2  1  2      1
2018-06-13 19:20:01  1  2  1      4
2018-06-13 19:20:02  2  1  4      2
2018-06-13 19:20:03  1  4  2      3
2018-06-13 19:20:04  4  2  3      2
2018-06-13 19:20:05  2  3  2      3

"""
# extract the train data and target
X = df[list(range(chunk_size-1))].values
y, label_map = binarize(df['Tasks'].values)

# Create a model, compile, fit
model = Sequential()
model.add(Embedding(len(np.unique(X))+1, 24, input_length=X.shape[-1]))
model.add(LSTM(24, return_sequences=True, input_shape=(date.shape[1], 1)))
model.add(Flatten())
model.add(Dense(4, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="adam")
history = model.fit(X, y,  epochs=100,  batch_size=1,  verbose=1)
Out:
Epoch 1/100
6/6 [==============================] - 1s 168ms/step - loss: 1.3885
Epoch 2/100
6/6 [==============================] - 0s 5ms/step - loss: 1.3811
Epoch 3/100
6/6 [==============================] - 0s 5ms/step - loss: 1.3781
...

- etc. Works somehow, but I kinly advice one more time: read a toturial linked above (or any othe forecasting tutorial). Because, for example, I haven't covered a testing/validation area in this example.

Upvotes: 2

Related Questions