What algorithm can I use to determine if any sequence of n words appears in two documents?

Question

Given two documents:

Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures.

(Richard III)

Richard (Duke of York): [bangs his goblet thrice on the table] Silence! Silence! For the king!

King (Richard III): [stands, hunched, speaks awkwardly] Now is the summer of our sweet content, [Made?] [err?]-cast winter by these Tudor clouds.

(BlackAdder, Season 1, Episode 1)

What sort of algorithm could someone use to find out which 3-word sequences exist in both documents? If there are any, of course - it's not guaranteed.

In this example, "now is the" is the only 3-word sequence that appears in both documents.

Nabb · Accepted Answer

Short of brute force, the simplest solution would be to make a hash map, with entries for each tuple of words for one text, and then check whether each tuple in the other text is present in the hash map. This would be linear given a constant window (3 words), but we can do this in linear time independent of the window size.

To do this efficiently, we use a rolling hash. The input characters to the rolling hash would be hashes of the words. This technique is known as the Rabin-Karp algorithm.

What algorithm can I use to determine if any sequence of n words appears in two documents?

Answers (2)

Related Questions