rakael
rakael

Reputation: 615

How to store spacy doc objects and reload them correctly?

I have a about 90 documents that i have processed with spacy.

import spacy, os

nlp = spacy.load('de')
index = 1
for document in doc_collection:
    doc = nlp(document)
    doc.to_disk('doc_folder/' + str(index))

It seems to be working fine. After that i want to reload the doc files later as a generator object.

def get_spacy_doc_list():
    for file in os.listdir(directory):
        filename = os.fsdecode(file)

        yield spacy.tokens.Doc(spacy.vocab.Vocab()).from_disk('doc_folder/' + filename)


for doc in get_spacy_doc_list():
    for token in doc:
        print(token.lemma_)

If I try this, then i get the following error:

KeyError: "[E018] Can't retrieve string for hash '12397158900972795331'."

How i can store and load the doc objects of spacy without getting this error? Thanks for your help!

Upvotes: 1

Views: 1730

Answers (1)

rakael
rakael

Reputation: 615

Found the solution:

yield spacy.tokens.Doc(spacy.vocab.Vocab()).from_disk('doc_folder/' + filename)

The Vocab()-instance should be the specific one of your nlp.

yield spacy.tokens.Doc(nlp.vocab).from_disk('doc_folder/' + filename)

Upvotes: 4

Related Questions