Reputation: 799
I am using gensim to analyze document similarity in a large corpus. Each document has a "title", or more specifically, a unique ID string, along with the content text.
After looking through several tutorials about top modeling, indexing and retrieval, and Wikipedia, what is still not clear to me is how to get interpretable results getting building the LSI model, and querying the index for some search vector. After I see the top N most similar document indexes and their similarity scores, how do I lookup the titles of those documents?
For example, in this code:
index.num_best = 10
print(index[query_lsi])
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.0
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.1
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.2
[(4028, 0.82495784759521484), (52384, 0.82495784759521484), (13582, 0.8166358470916748), (61938, 0.8166358470916748), (0, 0.80658835172653198), (48356, 0.80658835172653198), (85, 0.8048851490020752), (48441, 0.8048851490020752), (115, 0.79446637630462646), (48471, 0.79446637630462646)]
How would I lookup the title of, for example, document #61938 that came back in the most similar results?
In the previous part to that tutorial, the iter_wiki()
function yielded a tuple of the (title, tokens). That title
is what I want.
Upvotes: 0
Views: 803
Reputation: 2072
The second code you posted uses only precomputed vectors and models (see In[3]
and In[4]
in same code). It doesn't use or store the documents or the titles as-is, and hence, it's not possible to retrieve the titles of the documents.
However, the first code you posted defines and uses WikiCorpus
class, which has a list called titles
. You can simply use that list to retrieve the required titles.
So, basically this should work for you:
wiki_corpus.titles[id]
Upvotes: 1