Dealing with new words in gensim not found in model

Question

Lets say I am trying to compute the average distance between a word and a document using distances() or compute cosine similarity between two documents using n_similarity(). However, lets say these new documents contain words that the original model did not. How does gensim deal with that?

I have been reading through the documentation and cannot find what gensim does with unfound words.

I would prefer gensim to not count those in towards the average. So, in the case of distances(), it should simply not return anything or something I can easily delete later before I compute the mean using numpy. In the case of n_similarity, gensim of course has to do it by itself....

I am asking because the documents and words that my program will have to classify will in some instances contain unknown words, names, brands etc that I do not want to be taken into consideration during classification. So, I want to know if I'll have to preprocess every document that I am trying to classify.

gojomo · Accepted Answer

Depending on the context, Gensim will usually either ignore unknown words, or throw an error like KeyError when an exact-word lookup fails. (Also, some word-vector models, like FastText, can synthesize better-than-nothing guesswork vectors for unknown words based on word-fragments observed during training.)

You should try your desired operations with the specific models/method of interest to observe the results.

If operation-interrupting errors are thrown and a problem for your code, you could pre-filter your lists-of-words to remove those not also present in the model.

Dealing with new words in gensim not found in model

Answers (2)

Related Questions