RVT
RVT

Reputation: 39

Stanford NER Classifier linefeed issue

I'm using the Stanford NER with a 3 class model to identify PERSON, LOCATION, and ORGANIZATION in a file. It works fine except when there are names separated by a newline:
JANE DOE
JOHN DOE
JANE SMITH

The NER tools thinks these three names as one big name and not three names. If I put a comma after each name, it picks up the three names. How can I tell the tool to use the newline to separate the three names?

Upvotes: 1

Views: 147

Answers (1)

Christopher Manning
Christopher Manning

Reputation: 9450

If the names end up as successive tokens in the same "sentence", that is what will happen. The main thing you can do is to have the system tokenize/sentence split on newlines, then you will get a separate sentence for each name and things will work fine. In general, this will work fine if your text is formatted as one paragraph per-line (with soft line-wrapping, as is usual in modern text), but badly if you have text with hard line breaks (not at sentence/paragraph boundaries), because then the system will wrongly treat each line as a sentence. Commands that do this for both calling Stanford NER directly and through CoreNLP are:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,pos,lemma,ner" -file taylorswift.txt -outputFormat conll -ssplit.newlineIsSentenceBreak always

java edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -textFile taylorswift.txt -tokenizerOptions tokenizeNLs=true

Upvotes: 1

Related Questions