tradt
tradt

Reputation: 15

Coreference resolution using Stanford CoreNLP

I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing. For example:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = "As competition heats up in Spain's crowded bank market, Banco Exterior de Espana is seeking to shed its image of a state-owned bank and move into new activities.";

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

Then we can easily get the sentence annotations with:

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

However, I am using other tools for for preprocessing and just need a stand-alone coreference resolution system. It is pretty easy to create tokens and parse tree annotations and set them to the annotation:

// create new annotation
Annotation annotation = new Annotation();

// create token annotations for each sentence from the input file
List<CoreLabel> tokens = new ArrayList<>();
for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setWord(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        tokens.add(token);
    }

// set tokens annotations to annotation
annotation.set(TokensAnnotation.class, tokens);

// set parse tree annotations to annotation
Tree stanfordParseTree = Tree.valueOf(inputParseTree);
annotation.set(TreeAnnotation.class, stanfordParseTree);

However, creating sentence annotations is pretty tricky, because to my knowledge there is no document to explain it in full detail. I am able to create the data structure for the sentence annotations and set it to the annotation:

List<CoreMap> sentences = new ArrayList<CoreMap>();
annotation.set(SentencesAnnotation.class, sentences);

I am sure it cannot be that difficult, but there is no documentation on how to create sentence annotation from tokens annotations, i.e. how to fill the ArrayList with actual sentence annotations.

Any ideas?

Btw, if I use the tokens and parse tree annotations provided by my processing tools and only use the sentence annotations provided by the StanfordCoreNLP pipeline and apply the StanfordCoreNLP stand-alone coreference resolution system I am getting the correct results. So the only part missing for a complete stand-alone coreference resolution system is the ability to create the sentence annotations from the tokens annotations.

Upvotes: 1

Views: 2037

Answers (1)

Sebastian Schuster
Sebastian Schuster

Reputation: 1563

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences.

For each sentence you want to create a CoreMap object as following. (Note that I also added a sentence and token index to each sentence and token object, respectively.)

int sentenceIdx = 1;
List<CoreMap> sentences = new ArrayList<CoreMap>();
for (parsedSentence : parsedSentences) {
    CoreMap sentence = new CoreLabel();
    List<CoreLabel> tokens = new ArrayList<>();
    for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setLemma(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        token.setIndex(tokenCount + 1);
        tokens.add(token);
    }

    // set tokens annotations and id of sentence 
    sentence.set(TokensAnnotation.class, tokens);
    sentence.set(SentenceIndexAnnotation.class, sentenceIdx++);

    // set parse tree annotations to annotation
    Tree stanfordParseTree = Tree.valueOf(inputParseTree);
    sentence.set(TreeAnnotation.class, stanfordParseTree);

    // add sentence to list of sentences
    sentences.add(sentence);
}

Then you can create an Annotation instance with the sentences list:

Annotation annotation = new Annotation(sentences);

Upvotes: 5

Related Questions