Reputation: 43501
I am trying to build a small LSTM that can learn to write code (even if it's garbage code) by training it on existing Python code. I have concatenated a few thousand lines of code together in one file across several hundred files, with each file ending in <eos>
to signify "end of sequence".
As an example, my training file looks like:
setup(name='Keras',
...
],
packages=find_packages())
<eos>
import pyux
...
with open('api.json', 'w') as f:
json.dump(sign, f)
<eos>
I am creating tokens from the words with:
file = open(self.textfile, 'r')
filecontents = file.read()
file.close()
filecontents = filecontents.replace("\n\n", "\n")
filecontents = filecontents.replace('\n', ' \n ')
filecontents = filecontents.replace(' ', ' \t ')
text_in_words = [w for w in filecontents.split(' ') if w != '']
self._words = set(text_in_words)
STEP = 1
self._codelines = []
self._next_words = []
for i in range(0, len(text_in_words) - self.seq_length, STEP):
self._codelines.append(text_in_words[i: i + self.seq_length])
self._next_words.append(text_in_words[i + self.seq_length])
My keras
model is:
model = Sequential()
model.add(Embedding(input_dim=len(self._words), output_dim=1024))
model.add(Bidirectional(
LSTM(128), input_shape=(self.seq_length, len(self._words))))
model.add(Dropout(rate=0.5))
model.add(Dense(len(self._words)))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer="adam", metrics=['accuracy'])
But no matter how much I train it, the model never seems to generate <eos>
or even \n
. I think it might be because my LSTM size is 128
and my seq_length
is 200, but that doesn't quite make sense? Is there something I'm missing?
Upvotes: 14
Views: 1185
Reputation: 2047
Sometimes, when there is no limit for code generation
or the <EOS> or <SOS> tokens are not numerical tokens
LSTM never converges. If you could send your outputs or error messages, it would be much easier to debug.
You could create an extra class for getting words and sentences.
# tokens for start of sentence(SOS) and end of sentence(EOS)
SOS_token = 0
EOS_token = 1
class Lang:
'''
class for word object, storing sentences, words and word counts.
'''
def __init__(self, name):
self.name = name
self.word2index = {}
self.word2count = {}
self.index2word = {0: "SOS", 1: "EOS"}
self.n_words = 2 # Count SOS and EOS
def addSentence(self, sentence):
for word in sentence.split(' '):
self.addWord(word)
def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.n_words
self.word2count[word] = 1
self.index2word[self.n_words] = word
self.n_words += 1
else:
self.word2count[word] += 1
Then, while generating text, just adding a <SOS>
token would do.
You can use https://github.com/sherjilozair/char-rnn-tensorflow , a character level rnn for reference.
Upvotes: 4