Reputation: 33
I created an LDA model for some text files using gensim package in python. I want to get topic's distributions for the learned model. Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model? For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs) to get topic distribution in the document that used for creating the model.
id2word = corpora.Dictionary(doc_terms)
bow = id2word.doc2bow(doc_terms)
max_coherence = -1
best_lda_model = None
for num_topics in range(1, 6):
lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)
coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)
coherence_value = coherence_model.get_coherence()
if coherence_value > max_coherence:
max_coherence = coherence_value
best_lda_model = lda_model
The best has 4 topics
print(best_lda_model.num_topics)
4
But when I use get_document_topics, I get less than 4 values for document distribution.
topic_ditrs = best_lda_model.get_document_topics(bow)
print(len(topic_ditrs))
3
My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document? why some topics have very small distribution (less than 1-e8)?
Upvotes: 2
Views: 5780
Reputation: 883
Just type,
pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))
Upvotes: 1
Reputation: 869
You can play with the minimum_probability parameter and set it to a very small value like 0.000001.
topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]
Upvotes: 1
Reputation: 7164
From the documentation, you can use two methods for this.
If you are aiming to get the main terms in a specific topic, use get_topic_terms
:
from gensim.model.ldamodel import LdaModel
K = 10
lda = LdaModel(some_corpus, num_topics=K)
lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
lda.get_topic_terms(i, topn=10)
You can also print the entire underlying np.ndarray
(called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).
phi = lda.get_topics()
edit: From the link i included in the original answer: if you are looking for a document's topic distribution, use
res = lda.get_document_topics(bow)
As can be read from the documentation, the resulting object contains the following three lists:
list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.
list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.
list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.
Now,
tops, probs = zip(*res[0])
probs
will contains K (for you 4) probabilities. Some may be zero, but they should sum up to 1
Upvotes: 3