Reputation: 1904
I've implemented a inverted index in python, which is essentially a dictionary, whose key is words in the corpus, value is the tuple containing document that the key occurs in together with its bm25 score.
{
"love": [(doc1, 12), (doc3, 7.9), (doc5, 6.5)],
"hate": [(doc2, 8.7), (doc4, 3.2)]
}
However, when I process a query, I find it's hard to benefit from the efficiency of inverted index, because I must iterate all words in the query in a for loop. Within this loop, I must further loop over the documents the word links and maintain a global score table for all documents.
I think this is not the optimal way. Some ideas to speed up? I think a batch dictionary which accepts multiple keys and returns multiple values in parallel would help.
Upvotes: 1
Views: 410
Reputation: 11
It should be more efficient if you represent the inverted index as a matrix, particular a sparse matrix, where your rows are your corpus and the columns as each document.
Upvotes: 1