Ali
Ali

Reputation: 267287

Algorithm for analyzing text of words

I want an algorithm which would create all possible phrases in a block of text. For example, in the text:

"My username is click upvote. I have 4k rep on stackoverflow"

It would create the following combinations:

"My username"
"My Username is"
"username is click"
"is click"
"is click upvote"
"click upvote"
"i have"
"i have 4k"
"have 4k"
..

You get the idea. Basically the point is to get all possible combinations of 'phrases' out of a sentence. Any thoughts for how to best implement this?

Upvotes: 1

Views: 2756

Answers (5)

MEK
MEK

Reputation:

Could play with str_word_count(); and build it as you like.

Upvotes: 1

Shashikant Kore
Shashikant Kore

Reputation: 5052

You might already know that the technical term for such phrases is Shingle. You can get shingles for input text with Lucene's ShingeMatrixFilter.

Upvotes: 1

Andrew Jaffe
Andrew Jaffe

Reputation: 27097

Well, I don't know PHP or java, but basically you want a double loop over all the words in your text. Here's some pseudo-code:

words = split(text)
n = len(words)
for i in 1...n-1 {        // i = first word in phrase 
    for j in i+1...n {       // j = last word in phrase
        phrase = join(words[i:j])
        print phrase
    }
}

Note that the second loop starts from i, not 1. This gives you all phrases that start from word number i to word number j, which is greater than i (so all phrases have at least two words).

Ah, I just realized that you probably don't want phrases to cross sentence boundaries. So you'll want an outer loop which splits the text into sentences first, but then runs this on each sentence.

This seems pretty clear if you have any programming experience at all, but just in case: The for statements are loops [like for(i=1; i<=n; i++)], split is some function which takes a string and splits it into an array of words -- this is not completely trivial, but there's probably a library function to do this, len gives the length of the array, join puts them back together with spaces in between, and the syntax [i:j] means all of the elements from i to j inclusive (in python, this would actually be [i:j+1]). Oh, and I've implicitly assumed arrays start at index 1 rather than zero; I leave changing to 0-based C arrays as an exercise...

Finally, to answer the specific questions:

  • Note that the "second" loop is actually an inner loop; for each value of i (the first word of the phrase) we loop from i+1 to the end of the sentence to give the last word of the phrase.

  • Now that we have the number of the first and last words, the join function -- which you'll have to write -- concatenates the individual strings word[i], word[i+1], ... word[j] with spaces in between to form the phrase. In practice, this may mean the function could be declared like join(words, i, j) and returns the string, although some languages have ways to make this easier.

Upvotes: 5

paxdiablo
paxdiablo

Reputation: 882626

Basically you need to first separate the block of text into sentences. That's tricky enough, even in English since you need to look out for periods, question marks, exclamation marks and any other sentence terminators.

Then you process one sentence at a time after removing all punctuation (commas, semi-colons, colons, and so on).

Then, when you're left with an array of words, it becomes simpler:

for i = 1 to num_words-1:
    for j = i+1 to num_words:
        phrase = words[i through j inclusive]
        store phrase

That's it, pretty simple (after initial massaging of the text block, which may not be as simple as you think).

This will give you all phrases of two or more words in every sentence.

The separation into sentences, separation into words, removal of punctuation and so on will be the hardest bit but I've already shown you some simple initial rules to follow. The rest should be added every time a block of text breaks the algorithm.

Update:

As requested, here's some Java code which gives the phrases:

public class testme {
    public final static String text =
        "My username is click upvote." +
        " I have 4k rep on stackoverflow.";

 

    public static void procSentence (String sent) {
        System.out.println ("==========");
        System.out.println ("sentence [" + sent + "]");

        // Split sentence at whitspace into array.

        String [] sa = sent.split("\\s+");

        // Process each starting word.

        for (int i = 0; i < sa.length - 1; i++) {

            // Process each phrase.

            for (int j = i+1; j < sa.length; j++) {

                // Build the phrase.

                String phrase = sa[i];
                for (int k = i+1; k <= j; k++) {
                    phrase = phrase + " " + sa[k];
                }

                // This is where you have your phrase. I just
                // print it out but you can do whatever you
                // wish with it.
                System.out.println ("   " + phrase);
            }
        }
    }

 

    public static void main(String[] args) {
        // This is the block of text to process.

        String block = text;
        System.out.println ("block    [" + block + "]");

        // Keep going until no more sentences.

        while (!block.equals("")) {
            // Remove leading spaces.

            if (block.startsWith(" ")) {
                block = block.substring(1);
                continue;
            }

            // Find end of sentence.

            int pos = block.indexOf('.');

            // Extract sentence and remove it from text block.

            String sentence = block.substring(0,pos);
            block = block.substring(pos+1);

            // Process the sentence (this is the "meat").

            procSentence (sentence);

            System.out.println ("block    [" + block + "]");
        }
        System.out.println ("==========");
    }
}

which outputs:

block    [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
   My username
   My username is
   My username is click
   My username is click upvote
   username is
   username is click
   username is click upvote
   is click
   is click upvote
   click upvote
block    [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
   I have
   I have 4k
   I have 4k rep
   I have 4k rep on
   I have 4k rep on stackoverflow
   have 4k
   have 4k rep
   have 4k rep on
   have 4k rep on stackoverflow
   4k rep
   4k rep on
   4k rep on stackoverflow
   rep on
   rep on stackoverflow
   on stackoverflow
block    []
==========

Now, keep in mind this is pretty basic Java (some might say it's C written in a Java dialect :-). It's just meant to illustrate how to output word groupings from a sentence as you asked for.

It does not do all the fancy sentence detection and punctuation removal I mentioned in the original answer.

Upvotes: 5

Cuga
Cuga

Reputation: 17904

Just tokenize the sentence and use the CombinationGenerator. The algorithm is described by Kenneth H. Rosen, Discrete Mathematics and Its Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286.

Here's the code and example of use: http://www.merriampark.com/comb.htm

Upvotes: 2

Related Questions