Testik
Testik

Reputation: 59

How to predict entities for multiple sentences using spaCy?

I have trained an ner model using spaCy. I know how to use it to recognize the entities for a single sentence (doc object) and visualize the results:

doc = disease_blank('Example sentence')    
spacy.displacy.render(doc, style="ent", jupyter=True)

or

for ent in doc.ents:
    print(ent.text, ent.label_)

Now i want to predict the entities for multiple such sentences. My idea is to filter the sentences by their entities. At the moment i just found the following way to do it:

sentences = ['sentence 1', 'sentence2', 'sentence3']
for element in sentences:
    doc = nlp(element)
    for ent in doc.ents:
        if ent.label_ == "LOC":
        print(doc)
 # returns all sentences which have the entitie "LOC"

My question is if there is a better and more efficient way to do this?

Upvotes: 1

Views: 742

Answers (1)

David Espinosa
David Espinosa

Reputation: 879

You have 2 options, to speed up you current implementation:

  • Use the hints provided by spaCy developers here. Without knowing which specific components your custom NER model pipeline has, the refactorization of your code would like:
import spacy
import multiprocessing

cpu_cores = multiprocessing.cpu_count()-2 if multiprocessing.cpu_count()-2 > 1 else 1
nlp = spacy.load("./path/to/your/own/model")

sentences = ['sentence 1', 'sentence2', 'sentence3']
for doc in nlp.pipe(sentences, n_process=cpu_cores):  # disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] ... if your model has them. Check with `nlp.pipe_names`
    # returns all sentences which have the entitie "LOC"
    print([(doc) for ent in doc.ents if ent.label_ == "LOC"])
  • Combine the previous knowledge, with the use of spaCy custom components (as carefully explained here). Using this option, your refactorized / improved code would look like:
import spacy
import multiprocessing
from spacy.language import Language

cpu_cores = multiprocessing.cpu_count()-2 if multiprocessing.cpu_count()-2 > 1 else 1

@Language.component("loc_label_filter")
def custom_component_function(doc):
    old_ents = doc.ents
    new_ents = [item for item in old_ents if item.label_ == "LOC"]
    doc.ents = new_ents
    return doc


nlp = spacy.load("./path/to/your/own/model")
nlp.add_pipe("loc_label_filter", after="ner")

sentences = ['sentence 1', 'sentence2', 'sentence3']

for doc in nlp.pipe(sentences, n_process=cpu_cores):
    print([(doc) for ent in doc.ents])

IMPORTANT:

  1. Please notice these results will be noticeable if your sentences variable contains hundreds or thousands of samples; if sentences is "small" (i.e., it only contains a hundred or less sentences), you (and the time benchmarks) may not notice a big difference.
  2. Please also notice that batch_size parameter in nlp.pipe can be also fine tuned, but in my own experience, you want to do that ONLY if with the previous hints, you still don't see a considerable difference.

Upvotes: 1

Related Questions