kan
kan

Reputation: 31

Finding semantically related named entities in text

I have a set of text documents with tagged named entities like "person", "organization", "location", "product", "amount", "price" etc. I have already done fine-tuning of the BERT model to recognize these named entities. But I also need to solve the problem of finding related named entities in the text. For example, let's say we have a part of text like this:

Hey, Jack! There is work for you. Thomas Smith of the Big Corporation called this morning and ordered four pizzas for fifteen dollars, and Andy on 28th Street ordered sushi.

BERT will find the following named entities and their positions in this text:

I need a model that can split these entities into groups, which contain semantically related entities as follows:

Is it possible to solve such a problem if I have a training dataset with links between entities? Is there any neural network architecture that can be used on top of the BERT model embeddings for solving this problem? Maybe a graph model?

Upvotes: 2

Views: 808

Answers (1)

David Dale
David Dale

Reputation: 11444

In your example, all related entities are in the same sentence (but not all entities in the same sentence are related).

If this is the case, then I recommend splitting a sentence into components, and labelling the entities that belong to the same component as related.

To construct the components, you can build a syntax dependency tree of your sentence, and then cut the tree by removing some dependency edges. For example, you can split a sentence into sub-sentences if they have different subjects.

I use spacy to both find entities and build the syntax tree (but spacy does not recognize product names as entities, so you should use your own NER model). Also, you may want to invent your own rules for splitting sentences into parts.

from collections import defaultdict
import spacy
nlp = spacy.load("en_core_web_sm")

text = "Hey, Jack! There is work for you. Thomas Smith of the Big Corporation called this morning and ordered four pizzas for fifteen dollars, and Andy on 28th Street ordered sushi."
doc = nlp(text)

def find_cluster(token):
    # this token is a head of a sentence
    if token.dep_ == 'ROOT' or token.head == token:
        return token.idx
    # this token is a head of autonomous sub-sentence
    if token.dep_ == 'conj' and any(child.dep_ == 'nsubj' for child in token.children):
        return token.idx
    return find_cluster(token.head)

clusters = defaultdict(list)
for e in doc.ents:
    clusters[find_cluster(e[0])].append(e)

for c in clusters.values():
    print(c)

The expected output is:

# [Jack]
# [Thomas Smith, the Big Corporation, this morning, four, fifteen dollars]
# [Andy, 28th Street]

Upvotes: 1

Related Questions