gensim document similarity: how to get document titles from most similar results?

Question

I am using gensim to analyze document similarity in a large corpus. Each document has a "title", or more specifically, a unique ID string, along with the content text.

After looking through several tutorials about top modeling, indexing and retrieval, and Wikipedia, what is still not clear to me is how to get interpretable results getting building the LSI model, and querying the index for some search vector. After I see the top N most similar document indexes and their similarity scores, how do I lookup the titles of those documents?

For example, in this code:

index.num_best = 10
print(index[query_lsi])
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.0
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.1
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.2

[(4028, 0.82495784759521484), (52384, 0.82495784759521484), (13582, 0.8166358470916748), (61938, 0.8166358470916748), (0, 0.80658835172653198), (48356, 0.80658835172653198), (85, 0.8048851490020752), (48441, 0.8048851490020752), (115, 0.79446637630462646), (48471, 0.79446637630462646)]

How would I lookup the title of, for example, document #61938 that came back in the most similar results?

In the previous part to that tutorial, the iter_wiki() function yielded a tuple of the (title, tokens). That title is what I want.

gensim document similarity: how to get document titles from most similar results?

Answers (1)

Related Questions