user9165100
user9165100

Reputation: 421

spaCy: Word in vocabulary

I try to do typo correction with spaCy, and for that I need to know if a word exists in the vocab or not. If not, the idea is to split the word in two until all segments do exist. As example, "ofthe" does not exist, "of" and "the" do. So I first need to know if a word exists in the vocab. That's where the problems start. I try:

for token in nlp("apple"):
    print(token.lemma_, token.lemma, token.is_oov, "apple" in nlp.vocab)
apple 8566208034543834098 True True

for token in nlp("andshy"):
    print(token.lemma_, token.lemma, token.is_oov, "andshy" in nlp.vocab)
andshy 4682930577439079723 True True

It's clear that this make no sense, in both cases "is_oov" is True, and it is in the vocabulary. I'm looking for something simple like

"andshy" in nlp.vocab = False, "andshy".is_oov = True
"apple" in nlp.vocab = True, "apple".is_oov = False

And in the next step, also some word correction method. I can use the spellchecker library, but that's not consistent with the spaCy vocab

This problem appears to be a frequent question, and any suggestions (code) are most welcome.

thanks,

AHe

Upvotes: 6

Views: 7457

Answers (2)

piernik
piernik

Reputation: 307

For spellchecking, you can try spacy_hunspell. You can add this to the pipeline.

More info and sample code is here: https://spacy.io/universe/project/spacy_hunspell

Upvotes: 1

aab
aab

Reputation: 11494

Short answer: spacy's models do not contain any word lists that are suitable for spelling correction.

Longer answer:

Spacy's vocab is not a fixed list of words in a particular language. It is just a cache with lexical information about tokens that have been seen during training and processing. Checking whether a token is in nlp.vocab just checks whether a token is in this cache, so it's is not a useful check for spelling correction.

Token.is_oov has a more specific meaning that's not obvious from its short description in the docs: it reports whether the model contains some additional lexical information about this token like Token.prob. For a small spacy model like en_core_web_sm that doesn't contain any probabilities, is_oov will be True for all tokens by default. The md and lg models contain lexical information about 1M+ tokens and the word vectors contain 600K+ tokens, but these lists are too large and noisy to be useful for spelling correction.

Upvotes: 7

Related Questions