Reputation: 1917
I created a program to parse some texts files and count the number of words then sort them descending. This works great but I would like to take it up another level.
I would like to be able to find out any word phrases in the text that repeats and I am not sure how to go about doing it.
My current algorithm is to first split the text up into words then create a hash table with the word and count like this value:key
hash:
"word":3,
"test":12,
.....
then I just sort the has based on key and output and I am done.
Let's say that I have this happy birthday song:
Happy Birthday to You
Happy Birthday to You
Happy Birthday Dear (name)
Happy Birthday to You.
From good friends and true,
From old friends and new,
May good luck go with you,
And happiness too.
Alternative ending:
How old are you?
How old are you?
How old, How old
How old are you?
I can get the word counts just fine but what if I wanted to match all the phrases?
for instance this 6 word phrase could be said to be matched twice:
happy birthday to you happy birthday
a pair 5 word phrase match:
birthday to you happy birthday
happy birthday to you happy
some 4 word phrase matches
how old are you
happy birthday to you
to you happy birthday
how old how old
birthday to you happy
and so on down to two word phrases that match.
I am more concerned with the matching the entire phrase even across lines because I will have to look over the output for further processing anyways.
What type of algorithm would allow me to achieve this goal?
Upvotes: 1
Views: 400
Reputation: 181
First, you might want to tokenize the passage with a quick regex to make iterating over the words a bit easier, such as by using your language's String.split method on all whitespace/newline characters. That should leave you with a String array like so: ["Happy", "birthday", "to", "you", "happy", ...]
. You won't need to downcase the strings if you use regular expressions later on, which I suggest in this answer.
Following that, you need to extract phrases from the passage, which you can achieve by creating a start
and end
pointer and iterating like so:
for (var start = 0; start < tokens.length; start+=1) {
for (var end = start; end < tokens.length; end+=1) {
var phrase = tokens.slice(start, end)
// Count occurrences of phrase ...
}
}
The above would use every word as a start point for extraction and every following word as an end point for extraction, which allows single words and entire phrases to be picked up in phrase
. Note that there are (if my math is correct) (n + n^2) / 2 of these phrases, so this thing has exponential growth. If you are actively storing all of the phrases until the end, the memory usage could get pretty hefty for large data.
The regular expression match itself can find the number of occurrences of a given phrase, so you're not restricted to using a hashtable to store the results of your work. You can save on memory by only storing those phrases with more than one occurrence in the passage.
Upvotes: 1
Reputation: 1
You could use the same algorithm with word-combinations. If u use a queue of maximum size n, you can concat the last n words checked (e.g. via iterator) and add them to your hashtable. Repeat this for n=2 until n > ( your #words / 2 ) or no repetition was found
Example “W1 w2 w3, W3 w1 w2.“
Should give a hashtable with .. Hash2: “w1 w2“:2 “w2 w3“:1 “w3 w3“:1 “w3 w1“:1 ..for n=2 (ignoring capital letters and comma) For n=3 your highest count would be 1 and you could break
Cleaning newlines from your wordlist and using an aditional whitespace when concat-ing might be necessary
Upvotes: 0