Spacy Entity from PhraseMatcher only

Question

I'm using Spacy for a NLP project. I have a list of phrases I'd like to mark as a new entity type. I originally tried training a NER model but since there's a finite terminology list, I think simply using a Matcher should be easier. I see in the documentation that you can add entities to a document based on a Matcher. My question is: how do I do this for a new entity and not have the NER pipe label any other tokens as this entity? Ideally only tokens found via my matcher should be marked as the entity but I need to add it as a label to the NER model which then ends up labeling some as the entity.

Any suggestions on how to best accomplish this? Thanks!

Ines Montani · Accepted Answer

I think you might want to implement something similar to this example – i.e. a custom pipeline component that uses the PhraseMatcher and assigns entities. spaCy's built-in entity recognizer is also just a pipeline component – so you can remove it from the pipeline and add your custom component instead:

nlp = spacy.load('en')               # load some model
nlp.remove_pipe('ner')               # remove the entity recognizer
entity_matcher = EntityMatcher(nlp)  # use your own entity matcher component
nlp.add_pipe(entity_matcher)         # add it to the pipeline

Your entity matcher component could then look something like this:

from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

When your component is initialised, it creates match patterns for your terms, and adds them to the phrase matcher. My example assumes that you have a list of terms and a label you want to assign for those terms:

entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)

print(nlp.pipe_names)  # see all components in the pipeline

When you call nlp on a string of text, spaCy will tokenize text text to create a Doc object and call the individual pipeline components on the Doc in order. Your custom component's __call__ method then finds matches in the document, creates a Span for each of them (which allows you to assign a custom label) and finally, adds them to the doc.ents property and returns the Doc.

You can structure your pipeline component however you like – for example, you could extend it to load in your terminology list from a file or make it add multiple rules for different labels to the PhraseMatcher.

Spacy Entity from PhraseMatcher only

Answers (2)

Related Questions