Reputation: 121
I have a dataset with 4.7 million questions, and I want to compare their tf-idf vectors and retrieve the most similar pair for each question.
According to the gensim documentation,
There is also a special syntax for when you need similarity of documents in the index
to the index itself (i.e. queries=indexed documents themselves). This special syntax
uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:
for similarities in index: # yield similarities of the 1st indexed document, then 2nd... ...
pass
However, since I have about 4.7 million documents, similarities
should be a numpy array with length 4.7 million, which is very large too and I cannot store on memory.
index = Similarity.load('out/corpus.index')
idx1 = 0
for similarities in index: # <---- this part is slow
idx1 += 1
# and other stuff
Is there a way that I can get the most similar pair for each question?
Upvotes: 0
Views: 687
Reputation: 54233
The Similarity
class seems to have support for splitting the index over multiple files-on-disk, which might all be memory-mapped into addressable space, but that won't necessarily mean it's actually all in RAM at once.
However, as @green-cloak-guy notes in his comment, to compare against all docs, you will have to at least cycle all docs into memory from disk during the course those calculations. So, even if the model doesn't start with them all in RAM, doing certain operations will result in them all being brought in (though not all at once). And, whether that happens via swapping, or some other mechanism, will require similar amounts of I/O.
So if the only symptom you've seen is swapping when you do full-index operations, that will be inherent to any solution – and the Similarity
class may already be doing what you request - deferring the loading of ranges until they're needed.
Upvotes: 1