rwallace
rwallace

Reputation: 33365

Displaying topics associated with a document/query in Gensim

Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:

http://radimrehurek.com/gensim/tut3.html

It can also display what topics are associated with an entire model at all:

How to print the LDA topics models from gensim? Python

But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.

Upvotes: 0

Views: 1188

Answers (1)

kethort
kethort

Reputation: 399

If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation

from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
    for line in inp:
        text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest

From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model. You can visualize the distribution by creating a bar graph using matplotlib.

y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
    x_axis.append(topic_id + 1)
    y_axis.append(dist)
width = 1 
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()

enter image description here

If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model. You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.

Upvotes: 2

Related Questions