Reputation: 53
I'm still learning Python and creation of models and am very new to NLP using Spacy. I used https://spacy.io/usage/training#ner to train Spacy's existing model - en_core_web_sm.
I've trained this model with my domain specific entities.
def main(model="en_core_web_sm", new_model_name="new_ner_model", output_dir='/content/drive/My Drive/Data/new_model', n_iter=100):
.
.
(code to train the model)
.
.
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta["name"] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
Now I assumed that I would find a single model file within the output directory. Instead, what I have are 4 subfolders - vocab, ner, tagger, parser. And 2 files meta.json and tokenizer. The ner subfolder has cfg, moves, model.
According to the website mentioned above, to load the new model, I need to use the entire folder (output directory), i.e.
nlp2 = spacy.load(output_dir)
Is the whole directory needed (is that the model) or is it the binary file named model within the ner subfolder?
Upvotes: 1
Views: 1427
Reputation: 3106
In general we do advice to save the entire model as a folder, to make sure everything is loaded back in consistently. it won't work to just load the model
file in by itself. It just contains the weights of the neural network. Some of the other files are needed to define the parameters and setup of your NLP pipeline & its different components. For instance, you always need the vocab data, etc.
One thing you could do, is disable the components you're not interested in. This will decrease the folder size on your disk and remove the redundant folders you don't want. For instance, if you're only interested in the NER, you could do:
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])`
Or, if you loaded the whole model, you could store just parts of it to disk:
nlp.to_disk(output_dir, exclude=["parser", "tagger"])
Upvotes: 1