Amir
Amir

Reputation: 1979

Different result in StanfordNERTagger in python3.5 - Stanford-ner-2015-12-09

I tried to run a sample sentence:

from nltk.tag import StanfordNERTagger
_model_filename = r'D:/standford/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz'

_path_to_jar = r'D:/standford/stanford-ner-2015-12-09/stanford-ner.jar'

st = StanfordNERTagger(model_filename=_model_filename, path_to_jar=_path_to_jar)

st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 

My output was as below in python:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

while I was expected NY also selected as location based on this reference.

I tried another example as below:

st.tag('Ali is living in London.'.split())

the result was as below which was correct.

[('Ali', 'PERSON'), ('is', 'O'), ('living', 'O'), ('in', 'O'), ('London.', 'LOCATION')]

Do you have any idea why it didn't recognize NY as location in first sentence?

I am using visual studio 2015, Python 3.5, Stanford-ner-2015-12-09

Upvotes: 0

Views: 333

Answers (1)

alvas
alvas

Reputation: 122122

Stanford NER tool is trained on properly formatted news text so punctuation is quite important. From the docs:

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data.

From the CoNLL 2003 doc:

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.

By adding the fullstop to the example sentence, you should get your desired output, but still no model is perfect =)

alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/stanford-ner.jar
alvas@ubi:~$ export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/classifiers
alvas@ubi:~$ python3
Python 3.5.2 (default, Jul  5 2016, 12:43:10) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.tag import StanfordNERTagger
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
>>> sent = 'Rami Eid is studying at Stony Brook University in NY .'.split()
>>> st.tag(sent)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION'), ('.', 'O')]
>>> sent = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> st.tag(sent)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

Upvotes: 1

Related Questions