Reputation: 9207
I would like to label all entities which have not been labeled by a prior pattern as "unknown". Unfortunately the entity ruler seems not to care about the order of patterns which were provided:
import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{'label': 'Country', 'pattern': [{'lower': 'ger'}]},
{'label': 'Unknown', 'pattern': [{'OP': '?'}]}
]
ruler.add_patterns(patterns)
doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])
Expected:
[('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]
Actual:
[('ger', 'Unknown'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]
How can I ensure the patterns are matched in order?
Upvotes: 2
Views: 828
Reputation: 9207
Based of polm23 answer, here a working example code:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.blank("en")
# Normal Entity Ruler
ruler_standard = EntityRuler(nlp, overwrite_ents=True)
ruler_standard.name = 'ruler_standard'
ruler_standard = nlp.add_pipe("entity_ruler", name='ruler_standard', config={'overwrite_ents': True})
patterns = [{'label': 'Country', 'pattern': [{'lower': 'ger'}]}, ]
ruler_standard.add_patterns(patterns)
# Unknown Entity Ruler
ruler_unknown = EntityRuler(nlp, overwrite_ents=False)
ruler_unknown.name = 'ruler_unknown'
ruler_unknown = nlp.add_pipe("entity_ruler", name='ruler_unknown', after='ruler_standard', config={'overwrite_ents': False})
patterns = [{'label': 'Unknown', 'pattern': [{"OP": "?"}]}, ]
ruler_unknown.add_patterns(patterns)
doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])
# [('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]
Upvotes: 1
Reputation: 15623
There are a couple of ways to do this. A simple one is to use two EntityRulers . By default the second won't overwrite anything set by the first.
You could also use the relatively new SpanRuler with a custom filtering function which always prefers "unknown" entities.
Upvotes: 1