eng2019
eng2019

Reputation: 1035

Size of vocabulary SpaCy model 'en_core_web_sm'

I tried to see the number of words in vocabulary in SpaCy small model:

model_name="en_core_web_sm"

nlpp=spacy.load(model_name)

len(list(nlpp.vocab.strings))

which only gave me 1185 words. I also tried in my colleagues' machines and gave me different results (1198 and 1183).

Is it supposed to be like this to have only such a small vocabulary to train Part-Of-Speech tagging? When I use this in my dataset, I lose a lot of words. Why the number of words vary in different machines?

Thanks!

Upvotes: 3

Views: 1098

Answers (1)

bivouac0
bivouac0

Reputation: 2560

The vocabulary is dynamically loaded so you don't have all the words in the StringStore when you first load the vocab. You can see this if you try the following...

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> len(nlp.vocab.strings)
1180
>>> 'lawyer' in nlp.vocab.strings
False
>> doc = nlp('I am a lawyer')
>>> 'lawyer' in nlp.vocab.strings
True
>>> len(nlp.vocab.strings)
1182

It's probably easiest to simply load the vocabulary from the raw file like this..

>>> import json
>>> fn = '/usr/local/lib/python3.6/dist-packages/spacy/data/en/en_core_web_sm-2.0.0/vocab/strings.json'
>>> with open(fn) as f:
>>>     strings = json.load(f)
>>> len(strings)
78930

Note that the above file location is for Ubuntu 18.04. If you're on Windows there will be a similar file but in a different location.

Upvotes: 3

Related Questions