Reputation: 21
This is more of a best/common practices question.
We are using Spacy in our production system. While testing, many times we have to download full spacy models (parser + word vectors) which can be very slow (~30 mins) and frustrating. Perhaps a better strategy could be to create a custom lightweight spacy model for testing, e.g., with only 1000 word vocab and a smaller parsing model.
Are there suggested strategies/best practices when testing with a large data model that can be applied to this scenario?
Upvotes: 1
Views: 589
Reputation: 707
Even though it seems that @Rajhans problem has already been solved by @aniav's proposal and mocks and cache are probably a good idea in most cases, I would like to add something that helped me decrease unittest duration:
I realized that I was loading several spaCy components that I wasn't even using, i.e. spaCy might load the NER component, while you don't even use it. You can deactivate single components with
nlp = spacy.load("en_core_web_lg", disable=["tagger", "ner"])
which would disable the tagger and the ner recognition. See spaCy documentation for more details.
This not only decreases your unittest duration but has the nice side effect of also making your production code start up faster.
Upvotes: 1
Reputation: 1796
This basically depends on what and how you need to test. You probably don't really need or want to test spacy itself, you want to test your functions relying on the results from spacy and a good practice in that matter is to mock responses from spacy and test your code trusting spacy is working properly (it does have tests ;)). In our environment we have models loaded when spacy is being imported so we had to mock the imported module in order to not have these data loaded.
There is of course the option of creating lightweight versions of the models but this is not a trivial case, it would probably require work on each spacy version change and you have to keep in mind other developers should be able to update the models afterwards when tests / requirements change.
If you in fact need the models and the biggest problem is waiting for them to be downloaded consider using cache for the data. Many CI environments can cache the models for you and they will be valid until a newer version of spacy is introduced.
Upvotes: 2