Reputation: 1151
Why is it that spacy transformer models for languages like spanish (es_dep_news_trf
) don't do named entity recognition.
However, for english (en_core_web_trf
) it does.
In code:
import spacy
nlp=spacy.load("en_core_web_trf")
doc=nlp("my name is John Smith and I work at Apple and I like visiting the Eiffel Tower")
print(doc.ents)
(John Smith, Apple, the Eiffel Tower)
nlp=spacy.load("es_dep_news_trf")
doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
print(doc.ents)
()
Why doesn't spanish extract entities but english does?
Upvotes: 0
Views: 890
Reputation: 2615
The spaCy models vary with regards to which NLP features they provide - this is just a result of how the respective authors created/trained them. I.e., https://spacy.io/models/en#en_core_web_trf lists "ner" in its components, but https://spacy.io/models/es#es_dep_news_trf does not.
The Spanish https://spacy.io/models/es#es_core_news_lg (as well the two smaller variants) does list "ner" in its components, so they show named entities:
>>> import spacy
>>> nlp=spacy.load("es_core_news_sm")
>>> doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
>>> print(doc.ents)
(John Smith, Apple, Torre Eiffel)
Upvotes: 1
Reputation: 11484
It has to do with the available training data. ner
is only included for the trf
models if the training data has NER annotation on the exact same data as for tagging and parsing.
Training trf
models on partial annotation does not work well in practice and an independent NER component (as in the CNN pipelines) would mean including an additional transformer
component in the pipeline, which would make the pipeline a lot larger and slower.
Upvotes: 1