formicaman
formicaman

Reputation: 1357

Entity Linking with spacy/Wikipedia

I am trying to follow the example here: https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking. But I am just confused as to what is in the training data. Is it everything from Wikipedia? Say I just need training data on a few entities. For example, E1, E2, and E3. Does the example allow for me to specify only a few entities that I want to disambiguate?

Upvotes: 2

Views: 3453

Answers (1)

Sofie VL
Sofie VL

Reputation: 3106

[UPDATE] Note that this code base was moved to https://github.com/explosion/projects/tree/master/nel-wikipedia (spaCy v2)

If you run the scripts as provided in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking, they will indeed create a training dataset from Wikipedia you can use to train a generic model on.

If you're looking to train a more limited model, ofcourse you can feed in your own training set. A toy example can be found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_entity_linker.py, where you can deduce the format of the training data:

def sample_train_data():
    train_data = []

    # Q2146908 (Russ Cochran): American golfer
    # Q7381115 (Russ Cochran): publisher

    text_1 = "Russ Cochran his reprints include EC Comics."
    dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_1, {"links": dict_1}))

    text_2 = "Russ Cochran has been publishing comic art."
    dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_2, {"links": dict_2}))

    text_3 = "Russ Cochran captured his first major title with his son as caddie."
    dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_3, {"links": dict_3}))

    text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
    dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_4, {"links": dict_4}))

    return train_data

This example in train_entity_linker.py shows you how the model learns to disambiguate "Russ Cochran" the golfer (Q2146908) from the publisher (Q7381115). Note that it is just a toy example: a realistic application would require a larger knowledge base with accurate prior frequencies (as you can get by running the Wikipedia/Wikidata scripts), and ofcourse you would need many more sentences and lexical variety to expect the Machine Learning model to pick up proper clues and generalize efficiently to unseen text.

Upvotes: 3

Related Questions