Arun Abraham
Arun Abraham

Reputation: 4037

How to write efficient code for extracting Noun phrases?

I am trying to extract phrases using rules such as the ones mentioned below on text which has been POS tagged

1) NNP -> NNP (-> indicates followed by) 2) NNP -> CC -> NNP 3) VP -> NP etc..

I have written code in this manner, Can someone tell me how i can do in a better manner.

    List<String> nounPhrases = new ArrayList<String>();
    for (List<HasWord> sentence : documentPreprocessor) {

        //System.out.println(sentence.toString());
        System.out.println(Sentence.listToString(sentence, false));
        List<TaggedWord> tSentence = tagger.tagSentence(sentence);


        String lastTag = null, lastWord = null;
        for (TaggedWord taggedWord : tSentence) {
            if (lastTag != null && taggedWord.tag().equalsIgnoreCase("NNP") && lastTag.equalsIgnoreCase("NNP")) {
                nounPhrases.add(taggedWord.word() + " " + lastWord);
                //System.out.println(taggedWord.word() + " " + lastWord);

            }
            lastTag = taggedWord.tag();
            lastWord = taggedWord.word();
        }

    }

In the above code, i have done only for NNP followed by NNP extraction, how can i generalise it so that i can add other rules too. I know that there are libraries available for doing this , but wanted to do this manually.

Upvotes: 0

Views: 533

Answers (2)

Darshan Pandit
Darshan Pandit

Reputation: 178

Majority of the existing library-implementations do create a finite state machine to achieve this functionality. They are reliable, efficient and open. However, a very naive implementation idea can be to formulate Regular-Expressions over POS-Tag array and then use the offsets to mark the Phrases. Sounds logical and simple, though can be incorrect.

Upvotes: 0

wcolen
wcolen

Reputation: 1431

maybe you should try using a Chunker. You can try the OpenNLP Chunker. Looks like you use the same tagset for POS. You can find the usage here:

http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.chunker

Input example:

Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.

Output:

[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.

Upvotes: 1

Related Questions