ibarant
ibarant

Reputation: 276

How to load data for only certain label of Spacy's NER entities?

I just started to explore spaCy and need it only for GPE (Global political entities) of the name entity recognition (NER) component.

So, to save time on loading I keep only 'ner':

    nlp = spacy.load('en_core_web_sm', disable=['tok2vec','tagger','parser', 'senter', 'attribute_ruler', 'lemmatizer'])

Then I create a set of cities / states / countries that exist in the text by running:

doc = nlp(txt) 
geo_ents = {str(word) for word in doc.ents if word.label_=='GPE'}

That means I only need a small subset of the entities with the label_=='GPE'. I didn't find a way yet to iterate only within that component of the whole model to reduce runtime on big loads of texts.

Would you please guide me to how to load only certain label of Spacy's NER entities? That might be helpful for others in order to get only selected types of entities.

Thank you very much!

Upvotes: 1

Views: 1125

Answers (1)

polm23
polm23

Reputation: 15593

It isn't possible to do this. The NER model is classifying each token/span between all the labels it knows about, and the knowledge is not separable.

Additionally, the NER component requires a tok2vec. Depending on the pipeline architecture you may be able to disable the top-level tok2vec. (EDIT: I incorrectly stated the top-level tok2vec was required for the small English model; it is not. See here for details.)

It may be possible to train a smaller model that only recognizes GPEs with similar accuracy, but I wouldn't be too optimistic about it. It also wouldn't be faster.

Upvotes: 2

Related Questions