Reputation: 161
I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most important words. I assumed to get probabilities for all words per topic which add up to 1, which is not the case.
How can I interpret these values? For example I would like to be able to tell why topic #20 has words with much higher values than other topics. Has their absolute height to do with Bayesian probability? Is the topic more common in the corpus? Im not yet able to bring together theses values with the math behind the LDA.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
analyzer='word',
tokenizer = stemmer_sklearn.stem_ger())
tf = tf_vectorizer.fit_transform(texts)
n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50., random_state=0)
lda.fit(tf)
def print_top_words(model, feature_names, n_top_words):
for topic_id, topic in enumerate(model.components_):
print('\nTopic Nr.%d:' % int(topic_id + 1))
print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
+' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))
n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 |
Upvotes: 15
Views: 7609
Reputation: 933
From the documentation
components_ Variational parameters for topic word distribution. Since the complete conditional for topic word distribution is a Dirichlet, components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization:
model.components_ / model.components_.sum(axis=1)[:, np.newaxis]
.
So the values can be seen as a distribution if you normalize over the component to evaluate the importance of each term in the topic. AFAIU you cannot use the pseudo-count to compare the importance of two topics in the corpus as they are a smoothing factor applied to the term-topic distribution.
Upvotes: 6