Reputation: 375
My bigram language model works fine when one word is given in input, but when I give two words to my trigram model, it behaves strangely and predicts 'unknown' as the next word. My code:
def get_unigram_probability(word):
if word not in unigram:
return 0
return unigram[word] / total_words
def get_bigram_probability(words):
if words not in bigram:
return 0
return bigram[words] / unigram[words[0]]
V = len(vocabulary)
def get_trigram_probability(words):
if words not in trigram:
return 0
return trigram[words] + 1 / bigram[words[:2]] + V
for bi-gram next word prediction:
def find_next_word_bigram(words):
candidate_list = []
# Calculate probability for each word by looping through them
for word in vocabulary:
p2 = get_bigram_probability((words[-1], word))
candidate_list.append((word, p2))
# sort the list with words with often occurence in the beginning
candidate_list.sort(key=lambda x: x[1], reverse=True)
# print(candidate_list)
return candidate_list[0]
for trigram:
def find_next_word_trigram(words):
candidate_list = []
# Calculate probability for each word by looping through them
for word in vocabulary:
p3 = get_trigram_probability((words[-2], words[-1], word)) if len(words) >= 3 else 0
candidate_list.append((word, p3))
# sort the list with words with often occurence in the beginning
candidate_list.sort(key=lambda x: x[1], reverse=True)
# print(candidate_list)
return candidate_list[0]
I just want to know where in the code should I make changes, so that trigram would predict the next word with a given input size of 2 words.
Upvotes: 2
Views: 645
Reputation: 15623
When you build your trigrams, use a special BOS (beginning of sentence) token so you can handle short sequences. Basically before each sentence add BOS twice, like so:
I like cheese
BOS BOS I like cheese
This way when you take input from the user you can prepend BOS BOS
to it and be able to complete even short sequences.
Upvotes: 1