Reputation: 15199
Is there a way to extract paragraph information from Stanford CoreNLP? I'm currently using it to extract sentences from a document, but am also interested in identifying the paragraph structure of the document, which I'd ideally like CoreNLP to do for me. I have paragraph breaks as double line breaks in my source document. I've looked through CoreNLP's javadoc, and it seems there is a ParagraphAnnotation
class, but the documentation doesn't seem to specify what it contains, and I see no example anywhere of how to use it. Can anyone point me in the right direction?
For reference, my current code does something like this:
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
convertedSentences.add (new Sentence (sentence));
}
where Sentence's constructor extracts the words from the sentence. How would I extend this so that I get an extra level of data, that is my currently document-wide 'convertedSentences' list is supplemented by a 'convertedParagraphs' list, each entry of which contains a 'convertedSentences' list?
I tried the approach that seemed most obvious to me:
List<CoreMap> paragraphs = document.get(ParagraphsAnnotation.class);
for (CoreMap paragraph : paragraphs)
{
List<CoreMap> sentences = paragraph.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
convertedSentences.add (new Sentence (sentence));
}
convertedParagraphs.add (new Paragraph (convertedSentences));
}
but this didn't work, so I guess I misunderstand something about how this is supposed to work.
Upvotes: 4
Views: 1729
Reputation: 15199
It appears that the existence of a ParagraphsAnnotation
class in CoreNLP is a red herring - nothing actually uses this class (see http://grepcode.com/search/usages?type=type&id=repo1.maven.org%[email protected]%[email protected]@edu%24stanford%24nlp%[email protected]&k=u - quite literally, there are no references to this class other than its definition). Therefore, I have to break the paragraphs myself.
The key to this is to notice that each sentence contained within the SentencesAnnotation
contains a CharacterOffsetBeginAnnotation
. My code then becomes something like this:
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<Sentence> convertedSentences = new ArrayList<> ();
for (CoreMap sentence : sentences)
{
int sentenceOffsetStart = sentence.get (CharacterOffsetBeginAnnotation.class);
if (sentenceOffsetStart > 1 && text.substring (sentenceOffsetStart - 2, sentenceOffsetStart).equals("\n\n") && !convertedSentences.isEmpty ())
{
Paragraph current = new Paragraph (convertedSentences);
paragraphs.add (current);
convertedSentences = new ArrayList<> ();
}
convertedSentences.add (new Sentence (sentence));
}
Paragraph current = new Paragraph (convertedSentences);
paragraphs.add (current);
Upvotes: 7
Reputation: 76
I would implement this by recognizing paragraphs with regular expressions, which if you defined them with double line breaks should be no problem. Then you can implement Paragraphs as either an own class with just one field (an ArrayList with the sentences in the paragraph) or just simply use a list of sentences for representing a paragraph.
Upvotes: 0