Mohankumar
Mohankumar

Reputation: 43

Stanford NER, Output encoding issue

I am using Stanford NER 3.6.0 to identify names of person. I've no problem in generating an XML from either a input text file or a input XML file.

I am facing problem in reading the XML file returned by NER.

The two issues I am facing now are: 1. Name cannot begin with the ' ' character, hexadecimal value 0xA0.

  1. Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.

Im generating the XML output using JAR file and Command prompt.

Command line:

java -mx1000m -cp "D:/Downloads/Projects/Installations/stanford-ner-2015-12-09/stanford-ner.jar;D:/Downloads/Projects/Installations/stanford-ner-2015-12-09/lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier "D:/Downloads/Projects/Installations/stanford-ner-2015-12-09/classifiers/english.conll.4class.distsim.crf.ser.gz" -outputFormat inlineXML -textFile "C:\Users\Freeware Sys\AppData\Local\Temp\References (2)_in.txt" > "C:\Users\Freeware Sys\AppData\Local\Temp\References (2)_ner.xml" -inputEncoding "UTF-8" -outputEncoding "UTF-8"

Any help would be much appreciated.

Thanks.

Upvotes: 0

Views: 370

Answers (1)

Christopher Manning
Christopher Manning

Reputation: 9450

I guess we have been overclaiming/misleading with the name "inlineXML". In practice this simply means that Stanford NER outputs XML-style tags around entities. It has never meant that it produces a valid XML document as output. We could change that, but we'd probably produce something different, since it doesn't make much sense to have a different tag per entity type for real XML parsing.

If you want real XML, try CoreNLP's xml output, which is real XML:

java -mx1g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators "tokenize,ssplit,pos,lemma,ner", -ner.model edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz -ner.useSUTime false -outputFormat xml -file foo.txt -encoding "UTF-8"

Why are the non-breaking space characters a problem? They are deliberately used in Stanford NLP code in the rare cases (like phone numbers) where spaces are allowed inside single tokens. They are valid in an XML document coded in UTF-8.

Upvotes: 1

Related Questions