user1610950
user1610950

Reputation: 1917

Find matching phrases in a group of words

I created a program to parse some texts files and count the number of words then sort them descending. This works great but I would like to take it up another level.

I would like to be able to find out any word phrases in the text that repeats and I am not sure how to go about doing it.

My current algorithm is to first split the text up into words then create a hash table with the word and count like this value:key

hash:
    "word":3,
    "test":12,
     .....

then I just sort the has based on key and output and I am done.

Let's say that I have this happy birthday song:

Happy Birthday to You
Happy Birthday to You
Happy Birthday Dear (name)
Happy Birthday to You.

From good friends and true,
From old friends and new,
May good luck go with you,
And happiness too.

Alternative ending:
How old are you?
How old are you?
How old, How old
How old are you?

I can get the word counts just fine but what if I wanted to match all the phrases?

for instance this 6 word phrase could be said to be matched twice:

happy birthday to you happy birthday

a pair 5 word phrase match:

birthday to you happy birthday
happy birthday to you happy

some 4 word phrase matches

how old are you
happy birthday to you
to you happy birthday
how old how old
birthday to you happy

and so on down to two word phrases that match.

I am more concerned with the matching the entire phrase even across lines because I will have to look over the output for further processing anyways.

What type of algorithm would allow me to achieve this goal?

Upvotes: 1

Views: 400

Answers (2)

Tomboyo
Tomboyo

Reputation: 181

First, you might want to tokenize the passage with a quick regex to make iterating over the words a bit easier, such as by using your language's String.split method on all whitespace/newline characters. That should leave you with a String array like so: ["Happy", "birthday", "to", "you", "happy", ...]. You won't need to downcase the strings if you use regular expressions later on, which I suggest in this answer.

Following that, you need to extract phrases from the passage, which you can achieve by creating a start and end pointer and iterating like so:

for (var start = 0; start < tokens.length; start+=1) {
    for (var end = start; end < tokens.length; end+=1) {
        var phrase = tokens.slice(start, end)
        // Count occurrences of phrase ...
    }
}

The above would use every word as a start point for extraction and every following word as an end point for extraction, which allows single words and entire phrases to be picked up in phrase. Note that there are (if my math is correct) (n + n^2) / 2 of these phrases, so this thing has exponential growth. If you are actively storing all of the phrases until the end, the memory usage could get pretty hefty for large data.

The regular expression match itself can find the number of occurrences of a given phrase, so you're not restricted to using a hashtable to store the results of your work. You can save on memory by only storing those phrases with more than one occurrence in the passage.

Upvotes: 1

RageMcMad
RageMcMad

Reputation: 1

You could use the same algorithm with word-combinations. If u use a queue of maximum size n, you can concat the last n words checked (e.g. via iterator) and add them to your hashtable. Repeat this for n=2 until n > ( your #words / 2 ) or no repetition was found

Example “W1 w2 w3, W3 w1 w2.“

Should give a hashtable with .. Hash2: “w1 w2“:2 “w2 w3“:1 “w3 w3“:1 “w3 w1“:1 ..for n=2 (ignoring capital letters and comma) For n=3 your highest count would be 1 and you could break

Cleaning newlines from your wordlist and using an aditional whitespace when concat-ing might be necessary

Upvotes: 0

Related Questions