How to optimize SpaCy pipe for NER only (using an existing model, no training)

Question

I am looking to use SpaCy v3 to extract named entities from a large list of sentences. What I have works, but it seems slower than it should be, and before investing in more machines, I'd like to know if I am doing more work than I need to in the pipe.

I've used ntlk to parse everything into sentences as an iterator, then process these using "pipe" to get the named entities. All of this appears to work well, and python appears to be hitting every cpu core on my machine fairly heavily, which is good.

nlp = spacy.load("en_core_web_trf")
for (doc, context) in nlp.pipe(lines, as_tuples=True, batch_size=1000):
    for ent in doc.ents:
        pass #handle each entity

I understand that I can use nlp.disable_pipes to disable certain elements. Is there anything I can disable that won't impact accuracy and that isn't required for NER?

aab · Accepted Answer

For NER only with the transformer model en_core_web_trf, you can disable ["tagger", "parser", "attribute_ruler", "lemmatizer"].

If you want to use a non-transformer model like en_core_web_lg (much faster but slightly lower accuracy), you can disable ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] and use nlp.pipe(n_process=-1) for multiprocessing on all CPUs (or n_process=N to restrict to N CPUs).

How to optimize SpaCy pipe for NER only (using an existing model, no training)

Answers (1)

Related Questions