How to specify word vector for OOV terms in Spacy?

Question

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.

Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.

Example:

import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '
',nlp('gene').vector)

This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):

My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.

gojomo · Accepted Answer

If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:

words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ...  # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec 
              for vec in spacy_vecs]

I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.

And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

How to specify word vector for OOV terms in Spacy?

Answers (1)

Related Questions