What is the best way to overcome wrong entity recognision with Spacy?

Question

I'm testing such sentence to extract entity values: s = "Height: 3m, width: 4.0m, others: 3.4 m, 4m, 5 meters, 10 m. Quantity: 6."

sent = nlp(s)

for ent in sent.ents:
    print(ent.text, ent.label_)

And got some misleading values:

3 CARDINAL 4.0m CARDINAL 3.4 m CARDINAL 4m CARDINAL 5 meters QUANTITY 10 m QUANTITY 6 CARDINAL

namely, number 3m is not paired with m. This is the case for many examples as I can't rely on this engine when want to separate meters from quantities.

Should I do this manually?

Ines Montani · Accepted Answer

One potential difficulty in your example is that it's not very close to natural language. The pre-trained English models were trained on ~2m words of general web and news text, so they're not always going to perform perfect out-of-the-box on text with a very different structure.

While you could update the model with more examples of QUANTITY in your specific texts, I think that a rule-based approach might actually be a better and more efficient solution here.

The example in this blog post is actually very close to what you're trying to do:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
weights_pattern = [
    {"LIKE_NUM": True},
    {"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}
]
patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]
ruler = EntityRuler(nlp, patterns=patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp("U.S. average was 2 lbs.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]

The statistical named entity recognizer respects pre-defined entities and wil "predict around" them. So if you're adding the EntityRuler before it in the pipeline, your custom QUANTITY entities will be assigned first and will be taken into account when the entity recognizer predicts labels for the remaining tokens.

Note that this example is using the latest version of spaCy, v2.1.x. You might also want to add more patterns to cover different constructions. For more details and inspiration, check out the documentation on the EntityRuler, combining models and rules and the token match pattern syntax.

What is the best way to overcome wrong entity recognision with Spacy?

Answers (1)

Related Questions