Gensim Similarity with very large dataset (~4.7 million)

Question

I have a dataset with 4.7 million questions, and I want to compare their tf-idf vectors and retrieve the most similar pair for each question.

According to the gensim documentation,

There is also a special syntax for when you need similarity of documents in the index

to the index itself (i.e. queries=indexed documents themselves). This special syntax

uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:

for similarities in index: # yield similarities of the 1st indexed document, then 2nd... ...

pass

However, since I have about 4.7 million documents, similarities should be a numpy array with length 4.7 million, which is very large too and I cannot store on memory.

index = Similarity.load('out/corpus.index')
idx1 = 0
for similarities in index: # <---- this part is slow
  idx1 += 1
  # and other stuff

Is there a way that I can get the most similar pair for each question?

Gensim Similarity with very large dataset (~4.7 million)

Answers (1)

Related Questions