Train Keras model on party loaded data

Question

I'm trying to predict next character for a given string of length 100. The problem is when I'm generating training data my whole RAM (32 GB on Amazon AWS - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is being eaten and process is getting killed.

To build training data I iterate over list of articles (each have 500-1'000 chars). In each article I take first 100 chars as input and next char as output, then I move one char and repeat this till end of text. This approach produces a lot of training vectors, i.e. article that has 500 chars will produce about 400 test data and this is problem.

With 15k articles and sliding window 100 there will be millions of training data and my AWS machine (with 32 GB RAM t2.2xlarge - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is dying at about 79% - 35 million training data.

So my question is - is there a way in Keras to start learning model on, let's say 25% data, then load next 25% and do this until everything is consumed?

My pseudo code for learning:

with open(articles_path, 'rt', encoding="UTF-8") as file:
    for line in file:
        article = line[0:-1]
        article_length = len(article)
        # here is the problematic code 
        for i in range(0, article_length - seq_length, 1):
            seq_in = article[i:i + seq_length]
            seq_out = article[i + seq_length]
            dataX.append([tokens[char] for char in seq_in])
            dataY.append(tokens[seq_out])

model = Sequential()
model.add(LSTM(256, input_shape=(seq_length, 1)))
model.add(Dropout(0.2))
model.add(Dense(len(tokens), activation=activation))
model.compile(loss=loss, optimizer=optimizer)

model.fit(X, y, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

Note: when I was writing my own program I was using this tutorial https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

nuric · Accepted Answer

This looks like a good time to switch to a generator, essentially you will spit out a batch a time instead of loading the entire dataset:

def data_gen(batch_size=32):
  """Yield single batch at a time."""
  dataX, dataY = list(), list()
  while True: # the generator yields forever
    # here is the problematic code 
    for i in range(0, article_length - seq_length, 1):
      for _ in range(batch_size):
        seq_in = article[i:i + seq_length]
        seq_out = article[i + seq_length]
        dataX.append([tokens[char] for char in seq_in])
        dataY.append(tokens[seq_out])
      yield np.array(dataX), np.array(dataY)
      dataX, dataY = list(), list()

You can now train using fit_generator (ref) which will take batches from your generator. So you only process batch_size number of samples and not the entire set. You might want to use NumPy arrays instead of Python lists.

For a more organised version, you can implement a Sequence class which encapsulates the data and acts as a generator.

Train Keras model on party loaded data

Answers (2)

Related Questions