Reputation: 267287
I want an algorithm which would create all possible phrases in a block of text. For example, in the text:
"My username is click upvote. I have 4k rep on stackoverflow"
It would create the following combinations:
"My username"
"My Username is"
"username is click"
"is click"
"is click upvote"
"click upvote"
"i have"
"i have 4k"
"have 4k"
..
You get the idea. Basically the point is to get all possible combinations of 'phrases' out of a sentence. Any thoughts for how to best implement this?
Upvotes: 1
Views: 2756
Reputation: 5052
You might already know that the technical term for such phrases is Shingle. You can get shingles for input text with Lucene's ShingeMatrixFilter.
Upvotes: 1
Reputation: 27097
Well, I don't know PHP or java, but basically you want a double loop over all the words in your text. Here's some pseudo-code:
words = split(text)
n = len(words)
for i in 1...n-1 { // i = first word in phrase
for j in i+1...n { // j = last word in phrase
phrase = join(words[i:j])
print phrase
}
}
Note that the second loop starts from i, not 1. This gives you all phrases that start from word number i to word number j, which is greater than i (so all phrases have at least two words).
Ah, I just realized that you probably don't want phrases to cross sentence boundaries. So you'll want an outer loop which splits the text into sentences first, but then runs this on each sentence.
This seems pretty clear if you have any programming experience at all, but just in case: The for
statements are loops [like for(i=1; i<=n; i++)
], split
is some function which takes a string and splits it into an array of words -- this is not completely trivial, but there's probably a library function to do this, len
gives the length of the array, join
puts them back together with spaces in between, and the syntax [i:j]
means all of the elements from i
to j
inclusive (in python, this would actually be [i:j+1]
). Oh, and I've implicitly assumed arrays start at index 1 rather than zero; I leave changing to 0-based C
arrays as an exercise...
Finally, to answer the specific questions:
Note that the "second" loop is actually an inner loop; for each value of i
(the first word of the phrase) we loop from i+1
to the end of the sentence to give the last word of the phrase.
Now that we have the number of the first and last words, the join
function -- which you'll have to write -- concatenates the individual strings word[i], word[i+1], ... word[j]
with spaces in between to form the phrase. In practice, this may mean the function could be declared like join(words, i, j)
and returns the string, although some languages have ways to make this easier.
Upvotes: 5
Reputation: 882626
Basically you need to first separate the block of text into sentences. That's tricky enough, even in English since you need to look out for periods, question marks, exclamation marks and any other sentence terminators.
Then you process one sentence at a time after removing all punctuation (commas, semi-colons, colons, and so on).
Then, when you're left with an array of words, it becomes simpler:
for i = 1 to num_words-1:
for j = i+1 to num_words:
phrase = words[i through j inclusive]
store phrase
That's it, pretty simple (after initial massaging of the text block, which may not be as simple as you think).
This will give you all phrases of two or more words in every sentence.
The separation into sentences, separation into words, removal of punctuation and so on will be the hardest bit but I've already shown you some simple initial rules to follow. The rest should be added every time a block of text breaks the algorithm.
Update:
As requested, here's some Java code which gives the phrases:
public class testme {
public final static String text =
"My username is click upvote." +
" I have 4k rep on stackoverflow.";
public static void procSentence (String sent) {
System.out.println ("==========");
System.out.println ("sentence [" + sent + "]");
// Split sentence at whitspace into array.
String [] sa = sent.split("\\s+");
// Process each starting word.
for (int i = 0; i < sa.length - 1; i++) {
// Process each phrase.
for (int j = i+1; j < sa.length; j++) {
// Build the phrase.
String phrase = sa[i];
for (int k = i+1; k <= j; k++) {
phrase = phrase + " " + sa[k];
}
// This is where you have your phrase. I just
// print it out but you can do whatever you
// wish with it.
System.out.println (" " + phrase);
}
}
}
public static void main(String[] args) {
// This is the block of text to process.
String block = text;
System.out.println ("block [" + block + "]");
// Keep going until no more sentences.
while (!block.equals("")) {
// Remove leading spaces.
if (block.startsWith(" ")) {
block = block.substring(1);
continue;
}
// Find end of sentence.
int pos = block.indexOf('.');
// Extract sentence and remove it from text block.
String sentence = block.substring(0,pos);
block = block.substring(pos+1);
// Process the sentence (this is the "meat").
procSentence (sentence);
System.out.println ("block [" + block + "]");
}
System.out.println ("==========");
}
}
which outputs:
block [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
My username
My username is
My username is click
My username is click upvote
username is
username is click
username is click upvote
is click
is click upvote
click upvote
block [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
I have
I have 4k
I have 4k rep
I have 4k rep on
I have 4k rep on stackoverflow
have 4k
have 4k rep
have 4k rep on
have 4k rep on stackoverflow
4k rep
4k rep on
4k rep on stackoverflow
rep on
rep on stackoverflow
on stackoverflow
block []
==========
Now, keep in mind this is pretty basic Java (some might say it's C written in a Java dialect :-). It's just meant to illustrate how to output word groupings from a sentence as you asked for.
It does not do all the fancy sentence detection and punctuation removal I mentioned in the original answer.
Upvotes: 5
Reputation: 17904
Just tokenize the sentence and use the CombinationGenerator. The algorithm is described by Kenneth H. Rosen, Discrete Mathematics and Its Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286.
Here's the code and example of use: http://www.merriampark.com/comb.htm
Upvotes: 2