Reputation: 2874
I have seen that there is a paper supplying the idea behind Sense2Vec, but how are/were the standard spaCy models created in the first place? When I download something like the standard "en_core_web_md" model from the selection of models, how was that actually created? Are there any papers I can read or spaCy blog posts?
Bonus question:
How are the new models in the upcoming spaCy 2.0
so much smaller in size?
From the version 2 Release summary:
This release features entirely new deep learning-powered models for spaCy's tagger, parser and entity recognizer. The new models are 20x smaller than the linear models that have powered spaCy until now: from 300 MB to only 15 MB.
the only real reference that goes in this direction is here on the release summary. The summary of all model memory footprints can be found here.
Are the model weights provided and every call to get relevant attributes actually being computed on the fly? That would explain the slower throughput shown in the benchmarks on this page
Upvotes: 2
Views: 906
Reputation: 424
If you look at the releases in the models github repo https://github.com/explosion/spacy-models/releases, there are details on each part of the model, e.g. the tagger or parser, stating what data it was trained on and what the accuracy of the resulting model is:
Parser: OntoNotes 5, 91.5% Accuracy
Tagger: OntoNotes 5, 96.9% Accuracy
NER: OntoNotes 5, 84.7% Accuracy
Word vectors: Common Crawl
More details on the code necessary to train the model can be found here: http://spacy.io/docs/usage/training. There is also source code attached to the releases linked above, but I haven't checked what code that is.
Edit:
After reading through the discussion following the announcement of v2.0, I came across an issues that explains how the new NN models work internally.
You can find it here: https://github.com/explosion/spaCy/issues/1057
Upvotes: 2