wildcat89
wildcat89

Reputation: 1285

How to Train LSTM Using Multiple Datasets?

Try as I might, I have yet to find an answer for this question.

I’m wanting to simply train an LSTM network using Python 3.6 and TensorFlow, using multiple .csv files/datasets, like say for example using historical stock data for multiple companies.

The reason for this is I want to fit the model with a wide variety of price ranges, and not train individual models on every dataset. How would I go about doing this?

I can’t just append one dataset to another creating 1 big dataset because during the train/test split, prices may jump from $2 to $200 depending on the stock data and where the datasets are stitched together.

What is the best practise for doing something like this?

  1. Just create a loop for every .csv file and call the .fit function to train on each file one after another (updating its weights as it goes) for a certain number of epochs and using early stopping once the optimal loss is found? (Which I understand how to do now.)

  2. Is there a way to create a generator that could somehow yield a different x_train and y_train tuple from each .csv, fit the model with each tuple, and then have a training checkpoint after one tuple has been sampled from each .csv file? My thinking here is that the model should have a chance to sample a piece from each dataset before completing an epoch.

Example: let’s say I want to use a 20 period lookback/window size to predict t+1 ahead, and I have 5 .csv files to train with. The generator would (ideally) load all datasets into memory, and then pluck a random sample of 20 rows from the first .csv file, fit it to the model, then pluck another 20 rows from the second .csv, fit it, etc etc, and then once all 5 have been sampled, checkpoint to assess the loss, then move on to the next epoch and do it all over again.

This might be overkill but wanted to be thorough. And if option 1. would accomplish the same thing, that’s fine with me too, I just haven’t come across an answer yet.

Thanks!

UPDATE

Since I made this question, one of the ways I crafted my solution (for my particular application) was using the code below. Basically, if I pulled the last 5 years of stock price data for several different stocks, I would append one dataset on top of the other, all into one big dataset, then I would iterate through all the rows after assigning a "look back" period, so how many days the LSTM should look back in its features. It would then look at the date column, and so long as the dates for each group of 10 features were in ascending order, then bunch those features together for the LSTM. But if the date went from 2020-09-01 to 2015-09-01, that would mean that that part of the dataset was where a new stocks data would start, so just continue on down through the file until it finds 10 rows pertaining to one stock. This would make sure that the 3D shape of features for the LSTM was only for one particular stock.

Hope that makes some kind of sense. I commented the function pretty good so it should be easy to see how it works, then just defined a GRU model to show how it would be put into practise from there:

# A function to get a set of X's and Y's for training an LSTM, 
# so long as the dates are in ascending order, so you're not 
# stitching together different datasets X features from two 
# different datasets

def create_batched_dataset(x, y, time_steps=1): # Not really 1 if defined below, 10 by default
  
    x = x.reset_index() # Reset the index column so we can parse the dates
                        # to determine > or < among the dates
    x['Date'] = pd.to_datetime(x['Audit_Date']) # make the dates a datetime object

    xs, ys = [], [] # lists for our features/labels for LSTM

    for i in range(len(x) - time_steps): # Range 0 to 430 in my dataset

        v = x.iloc[i:(i + time_steps), :] # v = first 10 rows of X set

        if v['Date'].iloc[-1] <= v['Date'].iloc[0]: # Only batch from one training dataset, not where they stitch together.
                           # This checks that the last date and first date
                       # of the 10 rows are in the order they should be

            continue

        v = v.set_index(['Date']) # Set the index again

        xs.append(v.iloc[:, :-1].to_numpy()) # Append those 10 rows to your Xs list, without the target label in it

        ys.append(y.iloc[i + time_steps]) # Append their corresponding labels to Ys list, then continue

    return np.array(xs), np.array(ys) # np.array(xs


# Get our reshaped features/labels (to [samples, time_steps, n_features])
x_train, y_train = create_batched_dataset(train_scaled, train_scaled.iloc[:,-1], 10)
x_test, y_test = create_batched_dataset(test_scaled, test_scaled.iloc[:,-1], 10)


# Define some type of LSTM model
model = Sequential()
model.add(GRU(11, input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(Dense(11, activation="relu"))
model.add(Dense(1))
model.compile(loss='mae', optimizer=Adam(1 / 1000))
print(model.summary())

UPDATE 2 Here is another solution using lists. Basically, for every ticker we have a dataset for, impor the df, and add its stock price data to individual lists, then add those lists to one master list. Then, when you're ready to train, randomly pull a list of stock prices from the master list to feed into your NN. Note, you'll have to define open = prices[0], high = prices[1], etc etc. inside your NN function. Hope that helps:

prices_library = []
for ticker in list_of_tickers: # Used for multiple tickers
    print(ticker)

    df = pd.read_csv('./' + ticker + '_' + interval + 'm.csv')
    
    open = df['Open'].values.tolist()
    high = df['High'].values.tolist()
    low = df['Low'].values.tolist()
    close = df['Close'].values.tolist()
    volume = df['Volume'].values.tolist()

    prices_library.append([date,
                            open,
                            high,
                            low,
                            close,
                            volume])

for i in range(len(prices_library) * iterations):
    print('Iteration: ' + str(i+1) + ' of ' + str(len(prices_library) * iterations))
    agent.train(iterations=1, checkpoint=1, initial_money=initial_money, prices=prices_library[random.randint(0,len(prices_library)-1)])

Upvotes: 8

Views: 4775

Answers (1)

Jlanday
Jlanday

Reputation: 112

Merge all of the CSVs together into one file and give it enough steps so that it covers all of them. If you preprocess, you should create sequences in one training file that has one row per sequence where each sequence has the 20 or so previous periods for a given CSV. That way when they are fed randomly into the model, each sequence corresponds to the correct stock

Upvotes: 1

Related Questions