Cheong Sik Feng
Cheong Sik Feng

Reputation: 121

Gensim Similarity with very large dataset (~4.7 million)

I have a dataset with 4.7 million questions, and I want to compare their tf-idf vectors and retrieve the most similar pair for each question.

According to the gensim documentation,

There is also a special syntax for when you need similarity of documents in the index

to the index itself (i.e. queries=indexed documents themselves). This special syntax

uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:

for similarities in index: # yield similarities of the 1st indexed document, then 2nd... ...

pass

However, since I have about 4.7 million documents, similarities should be a numpy array with length 4.7 million, which is very large too and I cannot store on memory.

index = Similarity.load('out/corpus.index')
idx1 = 0
for similarities in index: # <---- this part is slow
  idx1 += 1
  # and other stuff

Is there a way that I can get the most similar pair for each question?

Upvotes: 0

Views: 687

Answers (1)

gojomo
gojomo

Reputation: 54233

The Similarity class seems to have support for splitting the index over multiple files-on-disk, which might all be memory-mapped into addressable space, but that won't necessarily mean it's actually all in RAM at once.

However, as @green-cloak-guy notes in his comment, to compare against all docs, you will have to at least cycle all docs into memory from disk during the course those calculations. So, even if the model doesn't start with them all in RAM, doing certain operations will result in them all being brought in (though not all at once). And, whether that happens via swapping, or some other mechanism, will require similar amounts of I/O.

So if the only symptom you've seen is swapping when you do full-index operations, that will be inherent to any solution – and the Similarity class may already be doing what you request - deferring the loading of ranges until they're needed.

Upvotes: 1

Related Questions