Reputation: 1794
I'm using Spacy for a NLP project. I have a list of phrases I'd like to mark as a new entity type. I originally tried training a NER model but since there's a finite terminology list, I think simply using a Matcher should be easier. I see in the documentation that you can add entities to a document based on a Matcher. My question is: how do I do this for a new entity and not have the NER pipe label any other tokens as this entity? Ideally only tokens found via my matcher should be marked as the entity but I need to add it as a label to the NER model which then ends up labeling some as the entity.
Any suggestions on how to best accomplish this? Thanks!
Upvotes: 4
Views: 3830
Reputation: 8558
As of spacy v2.1, there is spacy provides an out-of-the-box solution for doing this: the EntityRuler class.
To only match on the entities you care about, you could either disable the entity pipeline component before adding the custom component, or instantiate an empty language model and add the custom component to that, e.g.
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
my_patterns = [{"label": "ORG", "pattern": "spacy team"}]
ruler = EntityRuler(nlp)
ruler.add_patterns(my_patterns)
nlp.add_pipe(ruler)
doc = nlp("The spacy team are amazing!")
assert str(doc.ents[0]) == 'spacy team'
If all you want to do is tokenize the document and exact match entities from your terms list, instantiating the empty language model is probably the simplest solution.
Upvotes: 1
Reputation: 7105
I think you might want to implement something similar to this example – i.e. a custom pipeline component that uses the PhraseMatcher
and assigns entities. spaCy's built-in entity recognizer is also just a pipeline component – so you can remove it from the pipeline and add your custom component instead:
nlp = spacy.load('en') # load some model
nlp.remove_pipe('ner') # remove the entity recognizer
entity_matcher = EntityMatcher(nlp) # use your own entity matcher component
nlp.add_pipe(entity_matcher) # add it to the pipeline
Your entity matcher component could then look something like this:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
When your component is initialised, it creates match patterns for your terms, and adds them to the phrase matcher. My example assumes that you have a list of terms
and a label
you want to assign for those terms:
entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names) # see all components in the pipeline
When you call nlp
on a string of text, spaCy will tokenize text text to create a Doc
object and call the individual pipeline components on the Doc
in order. Your custom component's __call__
method then finds matches in the document, creates a Span
for each of them (which allows you to assign a custom label) and finally, adds them to the doc.ents
property and returns the Doc
.
You can structure your pipeline component however you like – for example, you could extend it to load in your terminology list from a file or make it add multiple rules for different labels to the PhraseMatcher
.
Upvotes: 12