Andreas
Andreas

Reputation: 9207

spacy entity ruler - how to order patterns

I would like to label all entities which have not been labeled by a prior pattern as "unknown". Unfortunately the entity ruler seems not to care about the order of patterns which were provided:

import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {'label': 'Country', 'pattern': [{'lower': 'ger'}]},
    {'label': 'Unknown', 'pattern': [{'OP': '?'}]}
]
ruler.add_patterns(patterns)
doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])

Expected:

[('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]

Actual:

[('ger', 'Unknown'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]

How can I ensure the patterns are matched in order?

Upvotes: 2

Views: 828

Answers (2)

Andreas
Andreas

Reputation: 9207

Based of polm23 answer, here a working example code:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.blank("en")

# Normal Entity Ruler
ruler_standard = EntityRuler(nlp, overwrite_ents=True)
ruler_standard.name = 'ruler_standard'
ruler_standard = nlp.add_pipe("entity_ruler", name='ruler_standard', config={'overwrite_ents': True})
patterns = [{'label': 'Country', 'pattern': [{'lower': 'ger'}]}, ]
ruler_standard.add_patterns(patterns)

# Unknown Entity Ruler
ruler_unknown = EntityRuler(nlp, overwrite_ents=False)
ruler_unknown.name = 'ruler_unknown'
ruler_unknown = nlp.add_pipe("entity_ruler", name='ruler_unknown', after='ruler_standard', config={'overwrite_ents': False})
patterns = [{'label': 'Unknown', 'pattern': [{"OP": "?"}]}, ]
ruler_unknown.add_patterns(patterns)


doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])
# [('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]

Upvotes: 1

polm23
polm23

Reputation: 15623

There are a couple of ways to do this. A simple one is to use two EntityRulers . By default the second won't overwrite anything set by the first.

You could also use the relatively new SpanRuler with a custom filtering function which always prefers "unknown" entities.

Upvotes: 1

Related Questions