Reputation: 3189
I'm trying to predict next character for a given string of length 100. The problem is when I'm generating training data my whole RAM (32 GB on Amazon AWS - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is being eaten and process is getting killed.
To build training data I iterate over list of articles (each have 500-1'000 chars). In each article I take first 100 chars as input and next char as output, then I move one char and repeat this till end of text. This approach produces a lot of training vectors, i.e. article that has 500 chars will produce about 400 test data and this is problem.
With 15k articles and sliding window 100 there will be millions of training data and my AWS machine (with 32 GB RAM t2.2xlarge - https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title) is dying at about 79% - 35 million training data.
So my question is - is there a way in Keras to start learning model on, let's say 25% data, then load next 25% and do this until everything is consumed?
My pseudo code for learning:
with open(articles_path, 'rt', encoding="UTF-8") as file:
for line in file:
article = line[0:-1]
article_length = len(article)
# here is the problematic code
for i in range(0, article_length - seq_length, 1):
seq_in = article[i:i + seq_length]
seq_out = article[i + seq_length]
dataX.append([tokens[char] for char in seq_in])
dataY.append(tokens[seq_out])
model = Sequential()
model.add(LSTM(256, input_shape=(seq_length, 1)))
model.add(Dropout(0.2))
model.add(Dense(len(tokens), activation=activation))
model.compile(loss=loss, optimizer=optimizer)
model.fit(X, y, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)
Note: when I was writing my own program I was using this tutorial https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
Upvotes: 1
Views: 374
Reputation: 11225
This looks like a good time to switch to a generator, essentially you will spit out a batch a time instead of loading the entire dataset:
def data_gen(batch_size=32):
"""Yield single batch at a time."""
dataX, dataY = list(), list()
while True: # the generator yields forever
# here is the problematic code
for i in range(0, article_length - seq_length, 1):
for _ in range(batch_size):
seq_in = article[i:i + seq_length]
seq_out = article[i + seq_length]
dataX.append([tokens[char] for char in seq_in])
dataY.append(tokens[seq_out])
yield np.array(dataX), np.array(dataY)
dataX, dataY = list(), list()
You can now train using fit_generator
(ref) which will take batches from your generator. So you only process batch_size
number of samples and not the entire set. You might want to use NumPy arrays instead of Python lists.
For a more organised version, you can implement a Sequence class which encapsulates the data and acts as a generator.
Upvotes: 1
Reputation: 23536
Your approach for data generation is interesting, but you don't have to generate every 100-byte sample out of your texts. Replace the problematic code with something like this:
for i in range(0, article_length - seq_length, 1):
if random.randint(1,10) not in [5, 6] : continue # this will skip 80% of the samples
seq_in = article[i:i + seq_length]
seq_out = article[i + seq_length]
dataX.append([tokens[char] for char in seq_in])
dataY.append(tokens[seq_out])
Place import random
somewhere around the beginning of the file. And once you put this into your code, only 1 out of 5 sequence will get into your training data, effectively reducing the size.
There's a way to make generation of the randomly sampled character strings more efficient, but it will require rewriting your code, and this approach just adds one extra line.
Upvotes: 0