Least Similar with Gensim Doc2Vec

Question

The most_similar method finds the top-N most similar words.

Is there a method or a way to find the N least similar words?

gojomo · Accepted Answer

You could get the full ranked list of all vectors by similarity, using a topn parameter as large as the full set of vectors. Then look at just the last N. For example:

import sys
all_sims = vec_model.most_similar(target_value, topn=sys.maxsize)
last_10 = list(reversed(all_sims[-10:]))

However, note:

This will require a bit more sorting, & momentarily need a lot more memory, to return the full list before trimming it to the last few
These are unlikely to be especially meaningful, as either words or documents, to human perception. That is, it's unlikely to be a word's or document's 'opposite' in the senses we perceive. Such opposites, or indeed any words/docs that are interestingly contrasted with an origin point, are usually going to be quite close to the origin in the high-dimensional space, just shifted in some meaningful way. (For example, a word's antonyms are far closer to the word than the most-dissimilar words this will find.)

Least Similar with Gensim Doc2Vec

Answers (1)

Related Questions