m.rel
m.rel

Reputation: 51

nltk tag tag_sents give different results

I essentially want to use the nltk StanfordNERTagger in order to purify a list of names (eg. there are organizations in there I want to remove) and I stumbled on weird issue. It seems the tag results of one sentence depend on what other sentences are given, which isn't very intuitive.

Here is how to reproduce:

from nltk.tag import StanfordNERTagger
tagger = StanfordNERTagger('/path/to/english.all.3class.distsim.crf.ser.gz','/path/to/stanford-ner-2017-06-09/stanford-ner.jar',encoding='utf-8')
things_to_tag = ["Star Trek".split(),
                 "Star Jones".split(),
                 "Star Wars".split()]

# tagging using tag_sents
print tagger.tag_sents( things_to_tag )

# tagging using tag
for t in things_to_tag:
    print tagger.tag(t)

Output:

[[(u'Star', u'ORGANIZATION'), (u'Trek', u'ORGANIZATION')],
[(u'Star', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION')],
[(u'Star', u'ORGANIZATION'), (u'Wars', u'ORGANIZATION')]]

[(u'Star', u'O'), (u'Trek', u'O')]
[(u'Star', u'PERSON'), (u'Jones', u'PERSON')]
[(u'Star', u'O'), (u'Wars', u'O')]

I also tried removing Star Wars from the list, and again the results change ('Trek' becomes Person, and 'Star' becomes O).

I looked into nltk/tag/stanford.py and it's not really clear why this would happen. I was hoping someone could lend a hand in identifying what the issue might be, or at least confirm I'm not the only one seeing this.

nltk version 3.2.5 python version 2.7.13

Upvotes: 3

Views: 260

Answers (1)

m.rel
m.rel

Reputation: 51

Ok so it has to do with whether or not you use this NLs tokenization. If you leave it as false, it will treat the input as one giant string, which means the predicted tags are now dependent on everything in the string. In my view, this is wrong. Changing it to 'true' and removing the quotes gives me the desired output.

To be extra clear, modify: '\"tokenizeNLs=false\"' --> 'tokenizeNLs=true'

Upvotes: 2

Related Questions