Jules
Jules

Reputation: 15199

Paragraph breaks using Stanford CoreNLP

Is there a way to extract paragraph information from Stanford CoreNLP? I'm currently using it to extract sentences from a document, but am also interested in identifying the paragraph structure of the document, which I'd ideally like CoreNLP to do for me. I have paragraph breaks as double line breaks in my source document. I've looked through CoreNLP's javadoc, and it seems there is a ParagraphAnnotation class, but the documentation doesn't seem to specify what it contains, and I see no example anywhere of how to use it. Can anyone point me in the right direction?

For reference, my current code does something like this:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        convertedSentences.add (new Sentence (sentence));
    }

where Sentence's constructor extracts the words from the sentence. How would I extend this so that I get an extra level of data, that is my currently document-wide 'convertedSentences' list is supplemented by a 'convertedParagraphs' list, each entry of which contains a 'convertedSentences' list?

I tried the approach that seemed most obvious to me:

List<CoreMap> paragraphs = document.get(ParagraphsAnnotation.class);
for (CoreMap paragraph : paragraphs)
{
        List<CoreMap> sentences = paragraph.get(SentencesAnnotation.class);
        List<Sentence> convertedSentences = new ArrayList<> ();
        for (CoreMap sentence : sentences)
        {
            convertedSentences.add (new Sentence (sentence));
        }

        convertedParagraphs.add (new Paragraph (convertedSentences));
}

but this didn't work, so I guess I misunderstand something about how this is supposed to work.

Upvotes: 4

Views: 1729

Answers (2)

Jules
Jules

Reputation: 15199

It appears that the existence of a ParagraphsAnnotation class in CoreNLP is a red herring - nothing actually uses this class (see http://grepcode.com/search/usages?type=type&id=repo1.maven.org%[email protected]%[email protected]@edu%24stanford%24nlp%[email protected]&k=u - quite literally, there are no references to this class other than its definition). Therefore, I have to break the paragraphs myself.

The key to this is to notice that each sentence contained within the SentencesAnnotation contains a CharacterOffsetBeginAnnotation. My code then becomes something like this:

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    List<Sentence> convertedSentences = new ArrayList<> ();
    for (CoreMap sentence : sentences)
    {
        int sentenceOffsetStart = sentence.get (CharacterOffsetBeginAnnotation.class);
        if (sentenceOffsetStart > 1 && text.substring (sentenceOffsetStart - 2, sentenceOffsetStart).equals("\n\n") && !convertedSentences.isEmpty ())
        {
            Paragraph current = new Paragraph (convertedSentences);
            paragraphs.add (current);
            convertedSentences = new ArrayList<> ();
        }           
        convertedSentences.add (new Sentence (sentence));
    }
    Paragraph current = new Paragraph (convertedSentences);
    paragraphs.add (current);

Upvotes: 7

Andreas M&#252;ller
Andreas M&#252;ller

Reputation: 76

I would implement this by recognizing paragraphs with regular expressions, which if you defined them with double line breaks should be no problem. Then you can implement Paragraphs as either an own class with just one field (an ArrayList with the sentences in the paragraph) or just simply use a list of sentences for representing a paragraph.

Upvotes: 0

Related Questions