user12541823
user12541823

Reputation:

create space KnowledgeBase for similar nouns

The entity linking examples in spacy's documentation are all based on named entities. Is it possible create a knowledgeable such that it links certain nouns with certain nouns?

For example, "aeroplane" with "plane" and "aeroplane" in case of a typing error? Such that I can pre-define the possible alternative terms that can be used for "aeroplane". Are there any concrete examples?

I tried this with Knowledgebase:

vocab = nlp.vocab
kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
kb.add_entity(entity="Aeroplane", freq=32, entity_vector=vector1)

as described here: https://spacy.io/api/kb

but I don't know what to use as the entity_vector, which is supposed to be a pre-trained vector of the entity.

Another example that I saw in the docs was this:

nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)

# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])

# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])

Can't we use anything else than wiki ids? and how do I get these vector lengths?

Upvotes: 2

Views: 1440

Answers (1)

Sofie VL
Sofie VL

Reputation: 3106

Let me try and address your questions:

The entity linking examples in spacy's documentation are all based on named entities. Is it possible create a knowledgeable such that it links certain nouns with certain nouns?

You can probably use the EL algorithm to link non-named entities with some tweaking. In theory, the underlying ML model really looks at sentence similarity and doesn't so much use the fact that the words/phrases are named entities or not.

spaCy's internals currently do assume that you're running the EL algorithm on NER results though. That means that it will only try to link Span objects stored in doc.ents. As a workaround, you could make sure that the words you're trying to link, are registered as named entities in doc.ents. You can train a custom NER algorithm that recognizes your specific terms, or run a rule-based matching strategy and set the doc.ents with the results of that.

Can't we use anything else than wiki ids?

Sure - you can use whatever you like, as long as the IDs are unique strings. Let's say that you represent the concept "airplane" with the unique string "AIRPLANE".

but I don't know what to use as the entity_vector, which is supposed to be a pre-trained vector of the entity.

The entity vector is the embedded representation of your concept, and this will be compared to the embedding of the sentence in which the alias occurs, to determine whether or not they match semantically.

There is some more documentation here: https://spacy.io/usage/training#kb

It's easiest if you make sure you have a model with pretrained vectors, typically the _md and _lg models.

Then, you need some kind of description about the entities in your database. For Wikidata, we used the description of an entity, eg "powered fixed-wing aircraft" from https://www.wikidata.org/wiki/Q197. You could also take the first sentence of the Wikipedia article, or anything else you want. As long as it provides some context about your concept.

Let me try to clarify all of the above with some example code:

nlp = spacy.load(model)
vectors_dim = nlp.vocab.vectors.shape[1]
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim)

airplane_description = "An airplane or aeroplane (informally plane) is a powered, fixed-wing aircraft that is propelled forward by thrust from a jet engine, propeller or rocket engine."
airplane_vector = nlp(airplane_description).vector

plane_description = "In mathematics, a plane is a flat, two-dimensional surface that extends infinitely far."
plane_vector = nlp(plane_description).vector

# TODO: Deduce meaningful "freq" values from a corpus: see how often the concept "PLANE" occurs and how often the concept "AIRPLANE" occurs
kb.add_entity(entity="AIRPLANE", freq=666, entity_vector=airplane_vector)
kb.add_entity(entity="PLANE", freq=333, entity_vector=plane_vector)

# TODO: Deduce the prior probabilities from a corpus. Here we assume that the word "plane" most often refers to AIRPLANE (70% of the cases), and infrequently to PLANE (20% of cases)
kb.add_alias(alias="airplane", entities=["AIRPLANE"], probabilities=[0.99])
kb.add_alias(alias="aeroplane", entities=["AIRPLANE"], probabilities=[0.97])
kb.add_alias(alias="plane", entities=["AIRPLANE", "PLANE"], probabilities=[0.7, 0.2])

So in theory, if you have a word "plane" in a mathematical context, the algorithm should learn that this matches the (embedded) description of the PLANE concept better than the AIRPLANE concept.

Hope that helps - I'm happy to discuss further in the comments!

Upvotes: 5

Related Questions