Jeong Kim
Jeong Kim

Reputation: 526

Least Similar with Gensim Doc2Vec

The most_similar method finds the top-N most similar words.

Is there a method or a way to find the N least similar words?

Upvotes: 1

Views: 1272

Answers (1)

gojomo
gojomo

Reputation: 54183

You could get the full ranked list of all vectors by similarity, using a topn parameter as large as the full set of vectors. Then look at just the last N. For example:

import sys
all_sims = vec_model.most_similar(target_value, topn=sys.maxsize)
last_10 = list(reversed(all_sims[-10:]))

However, note:

  • This will require a bit more sorting, & momentarily need a lot more memory, to return the full list before trimming it to the last few

  • These are unlikely to be especially meaningful, as either words or documents, to human perception. That is, it's unlikely to be a word's or document's 'opposite' in the senses we perceive. Such opposites, or indeed any words/docs that are interestingly contrasted with an origin point, are usually going to be quite close to the origin in the high-dimensional space, just shifted in some meaningful way. (For example, a word's antonyms are far closer to the word than the most-dissimilar words this will find.)

Upvotes: 1

Related Questions