Reputation: 16960
Given two documents:
Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures.
(Richard III)
Richard (Duke of York): [bangs his goblet thrice on the table] Silence! Silence! For the king!
King (Richard III): [stands, hunched, speaks awkwardly] Now is the summer of our sweet content, [Made?] [err?]-cast winter by these Tudor clouds.
(BlackAdder, Season 1, Episode 1)
What sort of algorithm could someone use to find out which 3-word sequences exist in both documents? If there are any, of course - it's not guaranteed.
In this example, "now is the" is the only 3-word sequence that appears in both documents.
Upvotes: 2
Views: 133
Reputation: 3494
Short of brute force, the simplest solution would be to make a hash map, with entries for each tuple of words for one text, and then check whether each tuple in the other text is present in the hash map. This would be linear given a constant window (3 words), but we can do this in linear time independent of the window size.
To do this efficiently, we use a rolling hash. The input characters to the rolling hash would be hashes of the words. This technique is known as the Rabin-Karp algorithm.
Upvotes: 2
Reputation: 279265
Sounds like you're looking for common substrings of length 3, except that the "characters" your "strings" are composed of, are actually words, not single characters. So, "I have a cunning plan" is a string of length 5.
Suffix trees usually get mentioned at this point: http://en.wikipedia.org/wiki/Generalised_suffix_tree, but whether you need that depends how long your texts are. A couple of nested loops to compare starting from each pair of word boundaries (one from each string) will get the job done eventually.
Upvotes: 2