Avatrin
Avatrin

Reputation: 187

Dealing with new words in gensim not found in model

Lets say I am trying to compute the average distance between a word and a document using distances() or compute cosine similarity between two documents using n_similarity(). However, lets say these new documents contain words that the original model did not. How does gensim deal with that?

I have been reading through the documentation and cannot find what gensim does with unfound words.

I would prefer gensim to not count those in towards the average. So, in the case of distances(), it should simply not return anything or something I can easily delete later before I compute the mean using numpy. In the case of n_similarity, gensim of course has to do it by itself....

I am asking because the documents and words that my program will have to classify will in some instances contain unknown words, names, brands etc that I do not want to be taken into consideration during classification. So, I want to know if I'll have to preprocess every document that I am trying to classify.

Upvotes: 1

Views: 275

Answers (2)

Serge
Serge

Reputation: 3765

The models are defined on vectors, which, by default setting, depend only on old words so I do not expect them depend on new words.

It is still possible, depending on code, for new words to affect results. To be on safe side I recommend to test your particular model and/or metrics on a small text (with and without a bunch of new words).

Upvotes: 1

gojomo
gojomo

Reputation: 54223

Depending on the context, Gensim will usually either ignore unknown words, or throw an error like KeyError when an exact-word lookup fails. (Also, some word-vector models, like FastText, can synthesize better-than-nothing guesswork vectors for unknown words based on word-fragments observed during training.)

You should try your desired operations with the specific models/method of interest to observe the results.

If operation-interrupting errors are thrown and a problem for your code, you could pre-filter your lists-of-words to remove those not also present in the model.

Upvotes: 2

Related Questions