dlesser
dlesser

Reputation: 41

How to combine entity terms using Spacy EntityRuler NLP?

I am working on using spacy for some NLP tasks, such as calculating entity frequency and PMI scores (relationship ranking between organization entities and lemmas). My corpus often has specific organizations with various permutations (e.g. Harman, HARMAN, Harman International...) that I want to always be recognized as one entity. This way, when counting frequencies they are all considered as one organization entity rather than separate, unique entities.

I believe the spacy.pipeline.EntityRuler should be the way to edit and update the spacy module, but I am not getting the desired outcome. After running the below code, the entity list does not appear to get updated. I still am returning the various permutations of the organization as unique entities.

I am not sure what I am doing wrong at this point, so any help is appreciated!

Thank you.

Code:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load('en_core_web_sm', disable = ['parser','tagger'])
ruler = EntityRuler(nlp, overwrite_ents = True) #replace entities that may exist with the following
patterns = [{"label": "ORG", "pattern": [{"TEXT":"HARMAN"}, {"TEXT":"International"}], "id": "harman"},
           {"label": "ORG", "pattern": [{"TEXT":"HARMAN"}], "id": "harman"},
           {"label": "ORG", "pattern": [{"TEXT":"Harman"}], "id": "harman"},
           {"label": "ORG", "pattern": [{"TEXT":"Harman"}, {"TEXT":"International"}], "id": "harman"}
           ] 
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
corpus_nlp = [nlp(corpus['Body'][i]) for i in corpus.index]

corpus_nlp[49].ents

(Harman, Zinnov, HARMAN, the "Leadership Zone)

Upvotes: 3

Views: 932

Answers (1)

iron9
iron9

Reputation: 525

What you're trying to do is to map parts of the text to real-world entities which in NLP is called entity linking.

You can extend the usual spaCy pipeline (e.g. 'en_core_web_lg') by an entity ruler using the patterns that you specified to make sure all of the occurrences of the text are recognized as entities. You can further write an entity linker (much simpler than the one provided by spaCy) that checks a knowledge base for what real-world entity the text entity should be mapped to.

You can create both the patterns (for the entity ruler) and the knowledge base (for the entity linker) from the same dictionary (ENTITIES_MAP) which allows for easy extension.

Using spaCy 3, it might look something like this. (Note that the linker could be simplified. I followed the general structure of the official component, making it a little more complicated.)

from typing import Callable, Optional, List, Iterable, Dict

import spacy
from spacy import Language, Vocab
from spacy.kb import KnowledgeBase
from spacy.tokens import Doc


# Map from entity to how it could look like in a text
# This can be edited in order to change the patterns and kb
ENTITIES_MAP = {
    "harman": [
        "harman",
        "Harman",
        "HARMAN",
        "HARMAN International",
        "Harman International",
    ],
    "other_company": [
        "Other Company",
        "OC"
    ]
}


class CustomLinker:
    def __init__(self, vocab: Vocab) -> None:
        self._vocab = vocab
        self._kb: Optional[KnowledgeBase] = None

    def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]) -> None:
        self._kb = kb_loader(self._vocab)

    def __call__(self, doc: Doc) -> Doc:
        kb_ids = self.predict([doc])
        self.set_annotations([doc], kb_ids)
        return doc

    def predict(self, docs: Iterable[Doc]) -> List[str]:
        assert self._kb is not None, "You forgot to call 'set_kb()'"
        assert len(self._kb) > 0, "kb is empty"
        kb_ids = []
        for doc in docs:
            for i, ent in enumerate(doc.ents):
                candidates = self._kb.get_alias_candidates(ent.text)
                if not candidates:
                    kb_ids.append("NIL")
                elif len(candidates) == 1:
                    kb_ids.append(candidates[0].entity_)
                else:
                    assert False, "The kb was set up ambiguously"
        return kb_ids

    def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
        count_ents = len([ent for doc in docs for ent in doc.ents])
        assert count_ents == len(
            kb_ids
        ), f"Number of entities is {count_ents}, but number of kb_ids is {len(kb_ids)}"
        i = 0
        for doc in docs:
            for ent in doc.ents:
                kb_id = kb_ids[i]
                i += 1
                for token in ent:
                    token.ent_kb_id_ = kb_id


@Language.factory("custom_linker")
def make_entity_linker(nlp: Language, name: str) -> CustomLinker:
    return CustomLinker(nlp.vocab)


def create_pipeline() -> Language:
    nlp = spacy.load("en_core_web_lg")

    # Entity ruler to make sure all occurrences of harman are recognized as entities at all
    patterns = [
        {"label": "ORG", "pattern": [{"TEXT": word} for word in alias.split(" ")]}
        for alias in [alias for aliases in ENTITIES_MAP.values() for alias in aliases]
    ]
    ruler = nlp.add_pipe("entity_ruler", last=True)
    ruler.add_patterns(patterns)

    # Entity linker to link all harman entities to the same real-world entity
    def create_kb(vocab):
        kb = KnowledgeBase(vocab, entity_vector_length=300)
        for name, aliases in ENTITIES_MAP.items():
            kb.add_entity(name, freq=1, entity_vector=vocab[name].vector)
            for alias in aliases:
                kb.add_alias(alias, entities=[name], probabilities=[1.0])
        return kb

    linker = nlp.add_pipe("custom_linker")
    linker.set_kb(create_kb)
    
    return nlp


def main():
    nlp = create_pipeline()

    text = (
        "The company HARMAN International is doing NLP."
        " Some other names for it are harman, Harman, Harman International and HARMAN, but not HI."
        " Other Company is a different one. Some also call it by the abbreviation OC."
    )
    doc = nlp(text)
    
    print([(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
    distinct_real_world_entities = set([ent.kb_id_ for ent in doc.ents]) - {"NIL"}
    for rwe in distinct_real_world_entities:
        count = len([ent for ent in doc.ents if ent.kb_id_ == rwe])
        print(f"'{rwe}' occurs {count} times")


if __name__ == "__main__":
    main()

It produces the following output:

[('HARMAN International', 'ORG', 'harman'), ('NLP', 'ORG', 'NIL'), ('harman', 'ORG', 'harman'), ('Harman', 'ORG', 'harman'), ('Harman International', 'ORG', 'harman'), ('HARMAN', 'ORG', 'harman'), ('Other Company', 'ORG', 'other_company'), ('OC', 'ORG', 'other_company')]
'harman' occurs 5 times
'other_company' occurs 2 times

Note that this will only work well as long as there are no ambiguities. If, for example, your text contains both 'apple' (the fruit) and 'Apple' (the company), you will probably be better off using the spaCy entity linker and creating training data for it. The process is explained here (although for an older version of spaCy).

Upvotes: 2

Related Questions