MTFS20
MTFS20

Reputation: 21

LSTM time series prediction - val and test loss lower then train loss

I am trying to predict the next vehicle speed based on the speed from previous steps. My current approach to achieve this is, is by using a LSTM neural network for time series forecasting. I already read a lot of tutorials about this problem and I have now built my own program for forecasting vehicle speeds. In my current setup, I try to predict the next vehicle speed based on the previous 20.

Data: I have a dataset of ~1000 different .csv files. Each .csv file contains vehicle speed that a real car drove, measured every second. The routes are different but in the same area and from the same driver (me). Therefore, each .csv file has a different length.

A typical drive cycle from my dataset

Data retrieval and splitting:

I get the filenames of my .csv files and split it into train-, validation and testset. I do this because the split should be done as fast as possible to prevent leakage.

full_files = [f for f in os.listdir(search_path) if os.path.isfile(f) and f.endswith(".csv")]
random.shuffle(full_files)
args = [cells, files_batched, None, history_range, future_range, steps_skipped, shifting]

files_len = len(full_files)

train_count = round(files_len * 0.7)
val_count = round(files_len * 0.2)
test_count = round(files_len * 0.1)

train_files = full_files[:train_count]
val_files = full_files[train_count:train_count + val_count]
test_files = full_files[train_count + val_count:train_count + val_count + test_count]

train_data = tuple(map(np.array, zip(*list(data_generator(train_files, *args)))))
val_data = tuple(map(np.array, zip(*list(data_generator(val_files, *args)))))
test_data = tuple(map(np.array, zip(*list(data_generator(test_files, *args)))))

I wrote a generator "data_generator" which is basically going through each file, read it and directly split it into my desirend LSTM input shape. Future_range is the amount of time steps i use as label, history range is the amount of previous time steps and cells are the features that I want to extract from each .csv file. I know that this might not be the best approach but it enures that I have no slice overlaps between different .csv files (which would happen if I read in everything and split it afterwards).

def multivariate_data(data_set, target_vector, start_index, end_index, history_size, target_size, skip_data=1, index_step=1):
    data = []
    labels = []

    start_index = start_index + history_size
    if end_index is None:
        end_index = len(data_set) - target_size

    for i in range(start_index, end_index, index_step):
        indices = range(i - history_size, i, skip_data)

        data.append(data_set[indices])
        labels.append(target_vector[i:i + target_size])

    return np.array(data), np.array(labels)


def data_generator(file_list, feature_cells, file_batches, sample_batches, history_step, prediction_step, skip_data=1, shift=True):
    i = 0
    while True:
        if i >= len(file_list):
            break
        else:
            printProgressBar(i, len(file_list) - 1)
            file_chunk = file_list[i * file_batches:(i + 1) * file_batches]
            for file in file_chunk:
                temp = pd.read_csv(open(file, 'r'), usecols=feature_cells, sep=";", header=None)
                norm_values = temp.values

                index_step = 1
                if shift is False:
                    index_step = history_step + 1
                train, label = multivariate_data(norm_values, norm_values[:, 0], 0, None, history_step, prediction_step, skip_data, index_step)

                if sample_batches is not None:
                    for index in range(0, len(train), sample_batches):
                        batch = train[index: index + sample_batches], label[index: index + sample_batches]
                        if batch[0].shape != (sample_batches, history_step, len(feature_cells)):
                            continue
                        yield batch
                else:
                    for index in range(0, len(train)):
                        yield train[index], label[index]
        i += 1

Scaling and shuffle:

Now I fit my MinMaxScaler to the training set (to prevent leakage) and apply the tranformation on train-, validation and testset. Then I create tensor slices and shuffle the data.

scaler = MinMaxScaler(feature_range=scaling)
scaler.fit(train_data[0].reshape(-1, train_data[0].shape[-1]))

train_x = scaler.transform(train_data[0].reshape(-1, train_data[0].shape[-1])).reshape(train_data[0].shape)
train_y = scaler.transform(train_data[1].reshape(-1, train_data[1].shape[-1])).reshape(train_data[1].shape)

val_x = scaler.transform(val_data[0].reshape(-1, val_data[0].shape[-1])).reshape(val_data[0].shape)
val_y = scaler.transform(val_data[1].reshape(-1, val_data[1].shape[-1])).reshape(val_data[1].shape)

test_x = scaler.transform(test_data[0].reshape(-1, test_data[0].shape[-1])).reshape(test_data[0].shape)
test_y = scaler.transform(test_data[1].reshape(-1, test_data[1].shape[-1])).reshape(test_data[1].shape)

train_len = len(train_x)
val_len = len(val_x)
test_len = len(test_x)

train_set = tf.data.Dataset.from_tensor_slices((train_x, train_y))
train_set = train_set.cache().shuffle(train_len).batch(batch_size).repeat()

val_set = tf.data.Dataset.from_tensor_slices((val_x, val_y))
val_set = val_set.batch(batch_size).repeat()

test_set = tf.data.Dataset.from_tensor_slices((test_x, test_y))
test_set = test_set.batch(batch_size).repeat()

Training:

At the end, i create my model and fit it onto my data.

train_steps = train_len // batch_size
val_steps = val_len // batch_size
test_steps = test_len // batch_size

model = Sequential()

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(dropout))

model.add(LSTM(64))
model.add(Dropout(dropout))

model.add(Dense(future_range))

early_stopping = EarlyStopping(monitor='val_loss', patience=2, mode='min')
checkpoint = ModelCheckpoint(search_path + "\\model", monitor='loss', verbose=0, save_best_only=True, mode='min')
optimizer = tf.optimizers.Adam(learning_rate=learning_rate)

model.compile(loss=tf.losses.MeanSquaredError(), optimizer=optimizer, metrics=[root_mean_squared_error])
history = model.fit(train_set, validation_data=val_set, epochs=epochs, steps_per_epoch=train_steps, validation_steps=val_steps, callbacks=[early_stopping])

When I start training, the loss starts with ~0.2 and drops under 0.05 after the first! epoch. Validation loss is always lower then the training loss. Also the loss on the test set is extremely low.

Validation and training loss

I think that it is obvious that these results are not "sane" for such a NN as vehicle speed is normally a quite complex function. I have already searched in the internet for possible mistakes and the only one that seems legitimate to me is data leakage. But in my opinion there is no way that information is leaked from training to validation set. I am spliting the files directly and use the scaling only on the training data.

I have also checked this article but I do not think that one of those problems fits: https://www.kdnuggets.com/2017/08/37-reasons-neural-network-not-working.html

I'm sorry if I made a stupid mistake, but I'm new to deep learning and don't know my way around very well. What could be the fault here?

EDIT: I tried to change the model from single step prediction to multi-step prediction. The shape of the "future" prediction is always the same (Always linear and not following the correct shape). The test set (also used for the prediction below) MSE loss is low, but RMSE is very high. How can this be?

predict 20 steps test set loss

Upvotes: 2

Views: 2114

Answers (1)

Grayrigel
Grayrigel

Reputation: 3594

This is more of a discussion question. Do you use heavy dropout? Because they might outright explain what is going on here. Trying experiments with different values of dropouts. You can also play with different regularization techniques.

On the example of dropout: due to disabling neurons, some of information about each sample is lost, and the subsequent layers attempt to construct the answers basing on incomplete representations. The training loss is higher because you've made it artificially harder for the network to give the right answers. However, during validation all of the units are available, so the network has its full computational power - and thus it might perform better than in training.

I will summarize the possible reasons:

  • Regularization is applied during training, but not during validation/testing.

  • Training loss is measured during each epoch while validation loss is measured after each epoch.

  • The validation set may be easier than the training set (or there may be leaks). Try cross validation if possible.

You can find more details on the topic here

Upvotes: 1

Related Questions