Reputation: 33365
Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:
http://radimrehurek.com/gensim/tut3.html
It can also display what topics are associated with an entire model at all:
How to print the LDA topics models from gensim? Python
But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.
Upvotes: 0
Views: 1188
Reputation: 399
If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation
from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
for line in inp:
text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest
From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model. You can visualize the distribution by creating a bar graph using matplotlib.
y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
x_axis.append(topic_id + 1)
y_axis.append(dist)
width = 1
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()
If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model. You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.
Upvotes: 2