stanford-nlp NER from a list of tokens

Question

Is there a way of using the stanford NER library to take input a list of tokens, and extract NEs?

I have checked the API but it is not explicit. Most of time the input is a String, a document, in both cases tokenization is done behind the scene.

In my case, I really have to do tokenization before and pass the list of tokens to the API. I have noticed that I can do:

List words = new ArrayList<>();

words.add(new Word("Tesco"));
..... //adding elements to words

List labels =classifier.classifySentence(words);

Is this correct?

Many thanks!!

StanfordNLPHelp · Accepted Answer

Here is one way to solve this issue:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class NERPreToken {
    public static void main (String[] args) {
        Properties props = new Properties();
        props.setProperty("annotators",
            "tokenize, ssplit, pos, lemma, ner");
        props.setProperty("tokenize.whitespace", "true");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        String [] tokensArray = {"Stephen","Colbert","hosts","a","show","on","CBS","."};
        List tokensList = Arrays.asList(tokensArray);
        String docString = String.join(" ",tokensList);
        Annotation annotation = new Annotation(docString);
        pipeline.annotate(annotation);
        List sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
            List tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);
            for (CoreLabel token : tokens) {
                System.out.println(token.word()+" "+token.get(CoreAnnotations.NamedEntityTagAnnotation.class));
            }
        }
    }
}

The key here is to start with your list of tokens and set the pipeline's property for tokenizing to just tokenize on white space. Then submit a String with your tokens joined by space.

stanford-nlp NER from a list of tokens

Answers (2)

Related Questions