Reputation: 187
Lets say I am trying to compute the average distance between a word and a document using distances() or compute cosine similarity between two documents using n_similarity(). However, lets say these new documents contain words that the original model did not. How does gensim deal with that?
I have been reading through the documentation and cannot find what gensim does with unfound words.
I would prefer gensim to not count those in towards the average. So, in the case of distances(), it should simply not return anything or something I can easily delete later before I compute the mean using numpy. In the case of n_similarity, gensim of course has to do it by itself....
I am asking because the documents and words that my program will have to classify will in some instances contain unknown words, names, brands etc that I do not want to be taken into consideration during classification. So, I want to know if I'll have to preprocess every document that I am trying to classify.
Upvotes: 1
Views: 275
Reputation: 3765
The models are defined on vectors, which, by default setting, depend only on old words so I do not expect them depend on new words.
It is still possible, depending on code, for new words to affect results. To be on safe side I recommend to test your particular model and/or metrics on a small text (with and without a bunch of new words).
Upvotes: 1
Reputation: 54223
Depending on the context, Gensim will usually either ignore unknown words, or throw an error like KeyError
when an exact-word lookup fails. (Also, some word-vector models, like FastText
, can synthesize better-than-nothing guesswork vectors for unknown words based on word-fragments observed during training.)
You should try your desired operations with the specific models/method of interest to observe the results.
If operation-interrupting errors are thrown and a problem for your code, you could pre-filter your lists-of-words to remove those not also present in the model.
Upvotes: 2