kevin.w.johnson
kevin.w.johnson

Reputation: 1794

Spacy Entity from PhraseMatcher only

I'm using Spacy for a NLP project. I have a list of phrases I'd like to mark as a new entity type. I originally tried training a NER model but since there's a finite terminology list, I think simply using a Matcher should be easier. I see in the documentation that you can add entities to a document based on a Matcher. My question is: how do I do this for a new entity and not have the NER pipe label any other tokens as this entity? Ideally only tokens found via my matcher should be marked as the entity but I need to add it as a label to the NER model which then ends up labeling some as the entity.

Any suggestions on how to best accomplish this? Thanks!

Upvotes: 4

Views: 3830

Answers (2)

David Marx
David Marx

Reputation: 8558

As of spacy v2.1, there is spacy provides an out-of-the-box solution for doing this: the EntityRuler class.

To only match on the entities you care about, you could either disable the entity pipeline component before adding the custom component, or instantiate an empty language model and add the custom component to that, e.g.

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()

my_patterns = [{"label": "ORG", "pattern": "spacy team"}]
ruler = EntityRuler(nlp)
ruler.add_patterns(my_patterns)
nlp.add_pipe(ruler)

doc = nlp("The spacy team are amazing!")
assert str(doc.ents[0]) == 'spacy team'

If all you want to do is tokenize the document and exact match entities from your terms list, instantiating the empty language model is probably the simplest solution.

Upvotes: 1

Ines Montani
Ines Montani

Reputation: 7105

I think you might want to implement something similar to this example – i.e. a custom pipeline component that uses the PhraseMatcher and assigns entities. spaCy's built-in entity recognizer is also just a pipeline component – so you can remove it from the pipeline and add your custom component instead:

nlp = spacy.load('en')               # load some model
nlp.remove_pipe('ner')               # remove the entity recognizer
entity_matcher = EntityMatcher(nlp)  # use your own entity matcher component
nlp.add_pipe(entity_matcher)         # add it to the pipeline

Your entity matcher component could then look something like this:

from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

When your component is initialised, it creates match patterns for your terms, and adds them to the phrase matcher. My example assumes that you have a list of terms and a label you want to assign for those terms:

entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)

print(nlp.pipe_names)  # see all components in the pipeline

When you call nlp on a string of text, spaCy will tokenize text text to create a Doc object and call the individual pipeline components on the Doc in order. Your custom component's __call__ method then finds matches in the document, creates a Span for each of them (which allows you to assign a custom label) and finally, adds them to the doc.ents property and returns the Doc.

You can structure your pipeline component however you like – for example, you could extend it to load in your terminology list from a file or make it add multiple rules for different labels to the PhraseMatcher.

Upvotes: 12

Related Questions