Daniel Wyatt
Daniel Wyatt

Reputation: 1151

Why don't spacy transformer models do NER for non-english models?

Why is it that spacy transformer models for languages like spanish (es_dep_news_trf) don't do named entity recognition.

However, for english (en_core_web_trf) it does.

In code:

import spacy    
nlp=spacy.load("en_core_web_trf")
doc=nlp("my name is John Smith and I work at Apple and I like visiting the Eiffel Tower")
print(doc.ents)
(John Smith, Apple, the Eiffel Tower)
    
nlp=spacy.load("es_dep_news_trf")
doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
print(doc.ents)
()

Why doesn't spanish extract entities but english does?

Upvotes: 0

Views: 890

Answers (2)

Hans
Hans

Reputation: 2615

The spaCy models vary with regards to which NLP features they provide - this is just a result of how the respective authors created/trained them. I.e., https://spacy.io/models/en#en_core_web_trf lists "ner" in its components, but https://spacy.io/models/es#es_dep_news_trf does not.

The Spanish https://spacy.io/models/es#es_core_news_lg (as well the two smaller variants) does list "ner" in its components, so they show named entities:

>>> import spacy  
>>> nlp=spacy.load("es_core_news_sm")
>>> doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
>>> print(doc.ents)
(John Smith, Apple, Torre Eiffel)

Upvotes: 1

aab
aab

Reputation: 11484

It has to do with the available training data. ner is only included for the trf models if the training data has NER annotation on the exact same data as for tagging and parsing.

Training trf models on partial annotation does not work well in practice and an independent NER component (as in the CNN pipelines) would mean including an additional transformer component in the pipeline, which would make the pipeline a lot larger and slower.

Upvotes: 1

Related Questions