Spacy - Entity Linking using descriptions from Wikipedia

Question

I am using the example from here: https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking.

There is a flag for using descriptions from Wikipedia instead of Wikidata. I set this to True (it should get descriptions from the Wikipedia data). But looking at the code under the Wikidata section,

if not descr_from_wp:
    logger.info("STEP 4c: Writing Wikidata entity descriptions to {}".format(entity_descr_path))
    io.write_id_to_descr(entity_descr_path, id_to_descr)

This should not run because the if statement is False. But under the Wikipedia section,

if descr_from_wp:
    logger.info("STEP 5b: Parsing and writing Wikipedia descriptions to {}".format(entity_descr_path))

It just logs something -- it doesn't actually seem to create the descriptions. And the output file has the headers: WD_id|description.

How can I get it to write the Wikipedia descriptions?

Karl Amort · Accepted Answer

I believe all the action happens in the line before the one you quoted:

wp.create_training_and_desc(wp_xml, entity_defs_path, entity_descr_path, 
training_entities_path, descr_from_wp, limit_train)

(this is [https://github.com/explosion/projects/blob/master/nel-wikipedia/wikidata_pretrain_kb.py#L142])

That funtion is one file over, at https://github.com/explosion/projects/blob/master/nel-wikipedia/wikidata_processor.py#L176:

def create_training_and_desc(
    wp_input, def_input, desc_output, training_output, parse_desc, limit=None
):
    wp_to_id = io.read_title_to_id(def_input)
    _process_wikipedia_texts(
        wp_input, wp_to_id, desc_output, training_output, parse_desc, limit
    )

That being said, having gone through this process a few days ago, I did get the impression that it's all in flux and there may be a bit of a mismatch between descriptions, the actual code, and versions of spacy. You may have noticed that the Readme starts with the instruction "Run wikipedia_pretrain_kb.py". And yet, such a file does not exist, only wikidata_pretrain_kb.py.

While the process did work (ventually), the final training progresses at a glacial speed of 10 seconds per example. For 300,000 examples in the training set, that would imply about a year of training at the default 10 epochs.

There are some instructions that suggest one isn't intended to run all the training data that's available. But in that case it seems strange to run 10 epochs on a repeating set of data with diminishing rates of return.

(Updated URLs Nov 2020. This example did not make it over from v2 -> v3 (yet?))

Spacy - Entity Linking using descriptions from Wikipedia

Answers (1)

Related Questions