Reputation: 671
I am trying to train my model to generate sentences no longer that 210 characters. From what I have read I have only seen training on 'continuous' text. Like a book. However I am trying to train my model on single sentences.
I'm pretty new to tensorflow and ML so right now I am able to train my model but it generates garbage, seemingly random text. I have 10,000 sentences so I think I have sufficient data.
Overview of my data
Structure [['SENTENCE'], ['SENTENCE2']...]
Data Prep
tokenizer = keras.preprocessing.text.Tokenizer(num_words=209, lower=False, char_level=True, filters='#$%&()*+-<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(df['title'].values)
df['encoded_with_keras'] = tokenizer.texts_to_sequences(df['title'].values)
dataset = df['encoded_with_keras'].values
dataset = tf.keras.preprocessing.sequence.pad_sequences(dataset, padding='post')
dataset = dataset.flatten()
dataset = tf.data.Dataset.from_tensor_slices(dataset)
sequences = dataset.batch(seq_len+1, drop_remainder=True)
def create_seq_targets(seq):
input_txt = seq[:-1]
target_txt = seq[1:]
return input_txt, target_txt
dataset = sequences.map(create_seq_targets)
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
Model
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None],input_length=209, mask_zero=True))
model.add(LSTM(rnn_neurons, return_sequences=True, stateful=True,))
model.add(Dropout(0.2))
model.add(Dense(258, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss="sparse_categorical_crossentropy")
return model
When I give the model a sequence to start from I get back absolute nonsense and eventually the model predicts a 0 which is not in the char_index mapping.
Edit
Text Generation
epochs = 2
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
model = create_model(vocab_size = vocab_size,
embed_dim=embed_dim,
rnn_neurons=rnn_neurons,
batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string):
num_generate = 200
input_eval = [char_2_index[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
temperature = 1
# model.reset_states()
for i in range(num_generate):
print(text_generated)
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0)
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
print(predicted_id)
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(index_2_char[predicted_id])
return (start_string + ''.join(text_generated))
Upvotes: 2
Views: 1046
Reputation: 86600
There are a few things that must be changed on the first sight.
num_words = vocab_size
stateful=True
if you don't want that "batch 2 is a sequel of batch 1", you have individual sentences, so stateful=False
. (Unless you are training correctly with manual training loops and resetting states for each batch, which is unnecessary trouble in the training phase) What you need to check visually:
Input data must have format like:
[
[1,2,3,6,10,4,10, ...up to sentence length - 1...],
[5,6,3,6,7,3,11,... up to sentence length - 1...],
.... up to number of sentences ...
]
Output data must then be:
[
[2,3,6,10,4,10,15 ...], #equal to input data, shifted by 1
[6,3,6,7,3,11,13, ...],
...
]
Print a few rows of them to check if they're correctly preprocessed as intended.
Training will then be easy:
model.fit(input_data, output_data, epochs=....)
Yes, your model will predict zeros, as you have zeros in your data, that's not weird: you did a pad_sequences
.
You can interpret a zero as a "sentence end" in this case, since you did a 'post'
pading. When your model gives you a zero, it decided that the sentence it's generating should end at that point - if it was well trained, it will probably continue outputting zeros for that sentence from this point on.
This part is more complex and you need to rewrite the model, now being stative=True
, and transfer the weights from the trained model to this new model.
Before anything, call model.reset_states()
.
You will need to manually feed a batch with shape (number_of_sentences=batch_size, 1)
. This will be the "first character" of each of the sentences it will generate. The output will be the "second character" of each sentence.
Get this output and feed the model with it. It will generate the "third character" of each sentence. And so on.
When all outputs are zero, all sentences are fully generated and you can stop the loop.
Call model.reset_states()
again before trying to generate a new batch of sentences.
You can find examples of this kind of predicting here: https://stackoverflow.com/a/50235563/2097240
Upvotes: 1