Reputation:
I'm reading the book 'The Practice of Programming' by Brian W. Kernighan and Rob Pike. Chapter 3 provides the algorithm for a Markov chain approach that reads a source text and uses it to generate random text that "reads well" (meaning the output is closer to proper-sounding English than gibberish):
set w1 and w2 to the first two words in the source text
print w1 and w2
loop:
randomly choose w3, one of the successors of prefix w1 and w2 in the source text
print w3
replace w1 and w2 by w2 and w3
repeat loop
My question is: what's the standard way to handle the situation where the new values for w2 and w3 have no successor in the source text?
Many thanks in advance!
Upvotes: 3
Views: 681
Reputation: 88158
The situation you are describing considers 3-grams, that is the statistical frequency of a 3-tuple in a given dataset. To create a Markov matrix with no adsorbing states, that is no points where a f_2(w1,w2) -> w3
and f_2(w2,w3) = 0
, you'll have to extend the possibilities. A generalized extension to @ThomasW's answers would be:
f_2(w1,w2) != 0
draw from thatf_1(w2) != 0
draw from thatf_0() != 0
draw from thatThat is, draw like normally from the 3-gram set, than the 2-gram set than the 1-gram set. At the last step you'll simply be drawing a word at random weighted by it's statistical frequency.
Upvotes: 1
Reputation: 1839
I believe that this is a serious problem in NLP, one without a straightforward solution. One approach is to tag the parts of speech in addition to the actual words, in order to generalize the mappings. Using parts of speech, the program can at least predict what part of speech should follow the words W2 and W3 if there is no precedent for the word sequence. "Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model." From Columbia notes.
Upvotes: 0
Reputation: 14164
Here your options:
I'd probably go with trying #2 or #3 once, then fallback to #1 -- which will always work.
Upvotes: 1