Find matching phrases in a group of words

Question

I created a program to parse some texts files and count the number of words then sort them descending. This works great but I would like to take it up another level.

I would like to be able to find out any word phrases in the text that repeats and I am not sure how to go about doing it.

My current algorithm is to first split the text up into words then create a hash table with the word and count like this value:key

hash:
    "word":3,
    "test":12,
     .....

then I just sort the has based on key and output and I am done.

Let's say that I have this happy birthday song:

Happy Birthday to You
Happy Birthday to You
Happy Birthday Dear (name)
Happy Birthday to You.

From good friends and true,
From old friends and new,
May good luck go with you,
And happiness too.

Alternative ending:
How old are you?
How old are you?
How old, How old
How old are you?

I can get the word counts just fine but what if I wanted to match all the phrases?

for instance this 6 word phrase could be said to be matched twice:

happy birthday to you happy birthday

a pair 5 word phrase match:

birthday to you happy birthday
happy birthday to you happy

some 4 word phrase matches

how old are you
happy birthday to you
to you happy birthday
how old how old
birthday to you happy

and so on down to two word phrases that match.

I am more concerned with the matching the entire phrase even across lines because I will have to look over the output for further processing anyways.

What type of algorithm would allow me to achieve this goal?

Tomboyo · Accepted Answer

First, you might want to tokenize the passage with a quick regex to make iterating over the words a bit easier, such as by using your language's String.split method on all whitespace/newline characters. That should leave you with a String array like so: ["Happy", "birthday", "to", "you", "happy", ...]. You won't need to downcase the strings if you use regular expressions later on, which I suggest in this answer.

Following that, you need to extract phrases from the passage, which you can achieve by creating a start and end pointer and iterating like so:

for (var start = 0; start < tokens.length; start+=1) {
    for (var end = start; end < tokens.length; end+=1) {
        var phrase = tokens.slice(start, end)
        // Count occurrences of phrase ...
    }
}

The above would use every word as a start point for extraction and every following word as an end point for extraction, which allows single words and entire phrases to be picked up in phrase. Note that there are (if my math is correct) (n + n^2) / 2 of these phrases, so this thing has exponential growth. If you are actively storing all of the phrases until the end, the memory usage could get pretty hefty for large data.

The regular expression match itself can find the number of occurrences of a given phrase, so you're not restricted to using a hashtable to store the results of your work. You can save on memory by only storing those phrases with more than one occurrence in the passage.

Find matching phrases in a group of words

Answers (2)

Related Questions