Jenő Fekete
Jenő Fekete

Reputation: 61

How to prepare irregularly spaced time-series data for classification using LSTM?

I have the variable holding 215 days worth of data structured like this: processed_data is a cell array of size 215×1, holding cells, where each cell contains data for a given day. Each cell (day) has a varying number of observations (with a mean of approximately 12,000 rows). Each row represents an observation, where: the first column contains the seconds elapsed since the previous row (not normalized), the second column contains the price of a specified security (normalized using z-score), and the third column is the target variable, signaling whether the price at that moment will be 0.01% higher (represented as 1) 60 seconds later or not (represented as 0). I'm using the first two columns as the predictors. I keep the days separate, because hours pass between the last row of day processed_data{i, 1} and the first row of day processed_data{i+1, 1}. Below is a sample of data from an arbitrary day:

2.57500000000437    0.502515050312692   0
1.03600000000006    0.469361050915526   1
1.05899999999383    0.386501335237771   1
0.838000000003376   0.436219680495852   0
1.12999999999738    0.469361050915526   0
0.824000000000524   0.369924327252462   1

I'm just a beginner in ML, and I'm having a really hard time imagining how the data for the LSTM layer should be formatted. If I'm correct, it needs 3-dimensional data, where one dimension represents the channel, another the time step, and another the batch. I'm now sure that I have completely misunderstood these concepts and have written the code below:

%% Partitioning data.
train_data_length = round(length(processed_data) * 0.9);
train_data = processed_data(1: train_data_length);
test_data = processed_data(train_data_length+1:end);

%% Training setup
% Convert data to cell arrays of dlarray.
train_X = cell(size(train_data));
train_Y = cell(size(train_data));

for day = 1:length(train_data)
    % Add batch dimension (C×B×T where B=1).
    data = permute(train_data{day}(:, 1:2)', [1 3 2]); % [2×1×T]
    train_X{day} = dlarray(data, "CBT");
    % Convert labels to one-hot encoded CBT format [2×1×T].
    labels = train_data{day}(:, 3)'; %  [1×T]
    one_hot_labels = onehotencode(labels, 1, 'ClassNames', [0 1]); % [2×T]
    one_hot_labels = reshape(one_hot_labels, 2, 1, []); % [2×1×T]
    train_Y{day} = dlarray(single(one_hot_labels), "CBT");

ds = combine(...
    arrayDatastore(train_X, 'OutputType', 'same'), ...
    arrayDatastore(train_Y, 'OutputType', 'same')...

%clearvars -except ds test_data ml_method

num_features = 2;
num_hidden_units = 128;
num_classes = 2;
mini_batch_size = 32;

layers = [
    sequenceInputLayer(num_features, 'Name', 'input')
    lstmLayer(num_hidden_units, 'OutputMode', 'sequence')

net = dlnetwork(layers);

options = trainingOptions('adam', ...
    'MaxEpochs', 30, ...
    'MiniBatchSize', mini_batch_size, ...
    'SequenceLength', 'longest', ...
    'Shuffle', 'every-epoch', ...
    'Plots', 'training-progress', ...
    'InputDataFormats', 'CBT', ...
    'Verbose', false, ...
    'ExecutionEnvironment', 'gpu');

net = trainnet(ds, net, 'crossentropy', options);

In the code above, I tried to define the channel as the number of predictors (2 in my case—most likely the only dimension I defined correctly). I set the batch to 1 because I thought it meant the network would use one observation to make predictions. I set the time step as the first column of a day's worth of data (the seconds passed since the last observation) because I thought it literally meant steps in time. Now I know that I was completely wrong. I also had to change the mini_batch_size to 32 from 128, which I found too low, but otherwise, I would run out of memory. I guess this is because of my incorrectly formatted data (I'm not sure if this is an important detail, but I'll include my GPU which is an RTX2070 Super with 8GB of memory). My question is: How should I format my data for the LSTM layer based on my goals? Or my goals are unrealistic and I'm using the data wrong?

I imagined this network to be able to make a prediction for every observation in the data.

Upvotes: 1

Views: 17

Answers (0)

Related Questions