MAGx2
MAGx2

Reputation: 3189

Train Keras model on party loaded data

I'm trying to predict next character for a given string of length 100. The problem is when I'm generating training data my whole RAM (32 GB on Amazon AWS - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is being eaten and process is getting killed.

To build training data I iterate over list of articles (each have 500-1'000 chars). In each article I take first 100 chars as input and next char as output, then I move one char and repeat this till end of text. This approach produces a lot of training vectors, i.e. article that has 500 chars will produce about 400 test data and this is problem.

With 15k articles and sliding window 100 there will be millions of training data and my AWS machine (with 32 GB RAM t2.2xlarge - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is dying at about 79% - 35 million training data.

So my question is - is there a way in Keras to start learning model on, let's say 25% data, then load next 25% and do this until everything is consumed?

My pseudo code for learning:

with open(articles_path, 'rt', encoding="UTF-8") as file:
    for line in file:
        article = line[0:-1]
        article_length = len(article)
        # here is the problematic code 
        for i in range(0, article_length - seq_length, 1):
            seq_in = article[i:i + seq_length]
            seq_out = article[i + seq_length]
            dataX.append([tokens[char] for char in seq_in])
            dataY.append(tokens[seq_out])

model = Sequential()
model.add(LSTM(256, input_shape=(seq_length, 1)))
model.add(Dropout(0.2))
model.add(Dense(len(tokens), activation=activation))
model.compile(loss=loss, optimizer=optimizer)

model.fit(X, y, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

Note: when I was writing my own program I was using this tutorial https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

Upvotes: 1

Views: 374

Answers (2)

nuric
nuric

Reputation: 11225

This looks like a good time to switch to a generator, essentially you will spit out a batch a time instead of loading the entire dataset:

def data_gen(batch_size=32):
  """Yield single batch at a time."""
  dataX, dataY = list(), list()
  while True: # the generator yields forever
    # here is the problematic code 
    for i in range(0, article_length - seq_length, 1):
      for _ in range(batch_size):
        seq_in = article[i:i + seq_length]
        seq_out = article[i + seq_length]
        dataX.append([tokens[char] for char in seq_in])
        dataY.append(tokens[seq_out])
      yield np.array(dataX), np.array(dataY)
      dataX, dataY = list(), list()

You can now train using fit_generator (ref) which will take batches from your generator. So you only process batch_size number of samples and not the entire set. You might want to use NumPy arrays instead of Python lists.

For a more organised version, you can implement a Sequence class which encapsulates the data and acts as a generator.

Upvotes: 1

lenik
lenik

Reputation: 23536

Your approach for data generation is interesting, but you don't have to generate every 100-byte sample out of your texts. Replace the problematic code with something like this:

    for i in range(0, article_length - seq_length, 1):
        if random.randint(1,10) not in [5, 6] : continue   # this will skip 80% of the samples
        seq_in = article[i:i + seq_length]
        seq_out = article[i + seq_length]
        dataX.append([tokens[char] for char in seq_in])
        dataY.append(tokens[seq_out])

Place import random somewhere around the beginning of the file. And once you put this into your code, only 1 out of 5 sequence will get into your training data, effectively reducing the size.

There's a way to make generation of the randomly sampled character strings more efficient, but it will require rewriting your code, and this approach just adds one extra line.

Upvotes: 0

Related Questions