Reputation: 7879
I would like to extract the spans of a tokenized String
of text. Using Stanford's CoreNLP, I have:
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
this.pipeline = new StanfordCoreNLP(props);
String answerText = "This is the answer";
ArrayList<IntPair> tokenSpans = new ArrayList<IntPair>();
// create an empty Annotation with just the given text
Annotation document = new Annotation(answerText);
// run all Annotators on this text
this.pipeline.annotate(document);
// Iterate over all of the sentences
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel fullToken: sentence.get(TokensAnnotation.class)) {
IntPair span = fullToken.get(SpanAnnotation.class);
tokenSpans.add(span);
}
}
However, all of the IntPairs
are null
. Do I need to add another annotator
in the line:
props.put("annotators", "tokenize, ssplit, pos, lemma");
Desired Output:
(0,3), (5,6), (8,10), (12,17)
Upvotes: 2
Views: 385
Reputation: 7879
The problem was in using SpanAnnotation
, which applies to Trees
. The correct class for this query is CharacterOffsetBeginAnnotation
and CharacterOffsetEndAnnotation
.
E.g. they can be used like so:
List<IntPair> spans = tokenSeq.stream()
.map(token ->
new IntPair(
token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),
token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class)))
...excuse my indentation
Upvotes: 2