Reputation: 16317
I used the NER for the following sentence on both NLTK and Spacy and below are the results:
"Zoni I want to find a pencil, a eraser and a sharpener"
I ran the following code on Google Colab.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
ex = "Zoni I want to find a pencil, a eraser and a sharpener"
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(ex)
sent
#Output:
[('Zoni', 'NNP'),
('I', 'PRP'),
('want', 'VBP'),
('to', 'TO'),
('find', 'VB'),
('a', 'DT'),
('pencil', 'NN'),
(',', ','),
('a', 'DT'),
('eraser', 'NN'),
('and', 'CC'),
('a', 'DT'),
('sharpener', 'NN')]
But when i used spacy on the same text, it didn't return me any result
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
text = "Zoni I want to find a pencil, a eraser and a sharpener"
doc = nlp(text)
doc.ents
#Output:
()
Its only working for some sentences.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
# text = "Zoni I want to find a pencil, a eraser and a sharpener"
text = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
doc = nlp(text)
doc.ents
#Output:
(European, Google, $5.1 billion, Wednesday)
Please let me know if there is something wrong.
Upvotes: 2
Views: 1479
Reputation: 3096
I'm not sure I understand the comparison you're trying to make. In your first example with NLTK, you're looking at the POS tags in the sentence. However in the second example with spaCy, you're looking at the Named Entities. These are two different things. The statistical models should always give you a POS tag per token (though it sometimes may be different), but the recognition of named entities (as explained in the post by 'Life is complex'), depend on the data sets that these models were trained on. If the models "feel like" there is no Named Entity in the sentence, you'll get an empty result set. But to get a fair comparison, you should also show the named entities found by NLTK, and compare with that.
If instead you wanted to compare POS tags, with spaCy you can run this:
for token in doc:
print(token.text, token.pos_, token.tag_)
Upvotes: 3
Reputation: 15619
Spacy models are statistical. So the named entities that these models recognize are dependent on the data sets that these models were trained on.
According to spacy documentation a named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.
For example, the name Zoni is not common, so the model doesn't recognize the name as being a named entity (person). If I change the name Zoni to William in your sentence spacy recognize William as a person.
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('William I want to find a pencil, a eraser and a sharpener')
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
#output
PERSON | William
One would assume that pencil, eraser and sharpener are objects, so they would potentially be classified as products, because spacy documentation states 'objects' are products. But that does not seem to be the case with the 3 objects in your sentence.
I also noted that if no named entities are found in the input text then the output will be empty.
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Zoni I want to find a pencil, a eraser and a sharpener')
if not doc.ents:
print ('No named entities were recognized in the input text.')
else:
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
Upvotes: 3