Frederik De boeck
Frederik De boeck

Reputation: 33

What is the probability of a TERM for a specific TOPIC in Latent Dirichlet Allocation (LDA) in R

I'm working in R, package "topicmodels". I'm trying to work out and better understand the code/package. In most of the tutorials, documentation I'm reading I'm seeing people define topics by the 5 or 10 most probable terms. Here is an example:

    library(topicmodels)
    data("AssociatedPress", package = "topicmodels")
    lda <- LDA(AssociatedPress[1:20,],  k = 5)
    topics(lda)
    terms(lda)
    terms(lda,5)

so the last part of the code returns me the 5 most probable terms associated with the 5 topics I've defined.

In the lda object, i can access the gamma element, which contains per document the probablity of beloning to each topic. So based on this I can extract the topics with a probability greater than any threshold I prefer, instead of having for everyone the same number of topics.

But my second step would then to know which words are strongest associated to the topics. I can use the terms(lda) function to pull this out, but this gives me the N so many.

In the output I've also found the

    lda@beta

which contains the beta per word per topic, but this is a Beta value, which I'm having a hard time interpreting. They are all negative values, and though I see some values around -6, and other around -200, i can't interpret this as a probability or a measure to see which words and how much stronger certain words associate to a topic. Is there a way to pull out/calculate anything that can be interpreted as such a measure.

many thanks Frederik

Upvotes: 3

Views: 1532

Answers (1)

Dietrich R-D
Dietrich R-D

Reputation: 21

The beta-matrix gives you a matrix with dimension #topics x #terms. The values are log-likelihoods, therefore you exp them. The given probabilities are of the type P(word|topic) and these probabilities only add up to 1 if you take the sum over the words but not over the topics P(all words|topic) = 1 and NOT P(word|all topics) = 1. What you are searching for is P(topic|word) but I actually do not know how to access or calculate it in this context. You will need P(word) and P(topic) I guess. P(topic) should be: colSums(lda@gamma)/sum(lda@gamma)

Becomes more obvious if you look at the gamma-matrix, which is #document x #topics. The given probabilites are P(topic|document) and can be interpreted as "what is the probability of topic x given document y". The sum over all topics should be 1 but not the sum over all documents.

Upvotes: 2

Related Questions