Reputation: 1035
I tried to see the number of words in vocabulary in SpaCy small model:
model_name="en_core_web_sm"
nlpp=spacy.load(model_name)
len(list(nlpp.vocab.strings))
which only gave me 1185 words. I also tried in my colleagues' machines and gave me different results (1198 and 1183).
Is it supposed to be like this to have only such a small vocabulary to train Part-Of-Speech tagging? When I use this in my dataset, I lose a lot of words. Why the number of words vary in different machines?
Thanks!
Upvotes: 3
Views: 1098
Reputation: 2560
The vocabulary is dynamically loaded so you don't have all the words in the StringStore when you first load the vocab. You can see this if you try the following...
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> len(nlp.vocab.strings)
1180
>>> 'lawyer' in nlp.vocab.strings
False
>> doc = nlp('I am a lawyer')
>>> 'lawyer' in nlp.vocab.strings
True
>>> len(nlp.vocab.strings)
1182
It's probably easiest to simply load the vocabulary from the raw file like this..
>>> import json
>>> fn = '/usr/local/lib/python3.6/dist-packages/spacy/data/en/en_core_web_sm-2.0.0/vocab/strings.json'
>>> with open(fn) as f:
>>> strings = json.load(f)
>>> len(strings)
78930
Note that the above file location is for Ubuntu 18.04. If you're on Windows there will be a similar file but in a different location.
Upvotes: 3